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Preface 


This volume, first published in hardback in 1994, presents an overview of the foun- 
dations and key theoretical concepts of Bayesian Statistics. Our original intention 
had been to produce further volumes on computation and methods. However, these 
projects have been shelved as tailored Markov chain Monte Carlo methods have 
emerged and are being refined as the standard Bayesian computational tools. We 
have taken the opportunity provided by this reissue to make a number of typograph- 
ical corrections. 

The original motivation for this enterprise stemmed from the impact and in- 
fluence of de Finetti’s two-volume Theory of Probability, which one of us helped 
translate into English from the Italian in the early 1970’s. This was widely ac- 
knowledged as the definitive exposition of the operationalist, subjectivist approach 
to uncertainty, and provided further impetus at that time to a growth in activity and 
interest in Bayesian ideas. 

From a philosophical, foundational perspective, the de Finetti volumes provide 
—in the words of the author’s dedication to his friend Segre — 


... a necessary document for clarifying one point of view in its entirety. 


From a statistical, methodological perspective, however, the de Finetti vol- 
umes end abruptly, with just the barest introduction to the mechanics of Bayesian 
inference. 

Some years ago, we decided to try to write a series of books which would take 
up the story where de Finetti left off, with the grandiose objective of “clarifying in 
its entirety” the world of Bayesian statistical theory and practice. 


viii Preface 


It is now clear that this was a hopeless undertaking. The world of Bayesian 
Statistics has been changing shape and growing in size rapidly and unpredictably — 
most notably in relation to developments in computational methods and the sub- 
sequent opening up of new application horizons. We are greatly relieved that we 
were too incompetent to finish our books a few years ago! 

And, of course, these changes and developments continue. There is no static 
world of Bayesian Statistics to describe in a once-and-for-all way. Moreover, we 
are dealing with a field of activity where. even among those whose intellectual 
perspectives fall within the broad paradigm, there are considerable differences of 
view at the level of detail and nuance of interpretation. 

This volume on Bayesian Theory attempts to provide a fairly complete and 
up-to-date overview of what we regard as the key concepts, results and issues. 
However, it necessarily reflects the prejudices and interests of its authors — as well 
as the temporal constraints imposed by a publisher whose patience has been sorely 
tested for far too long. We can but hope that our sins of commission and omission 
are not too grievous. 

Too many colleagues have taught us too many things for it to be practical to 
list everyone to whom we are beholden. However, Dennis Lindley has played a 
special role, not least in supervising us as Ph.D. students, and we should like to 
record our deep gratitude to him. We also shared many enterprises with Morrie 
DeGroot and continue to miss his warmth and intellectual stimulation. For detailed 
comments on earlier versions of material in this volume, we are indebted to our 
colleagues M. J. Bayarri, J. O. Berger. J. de la Horra, P. Diaconis, F. J. Giron, M. A. 
Gémez- Villegas, D. V. Lindley, M. Mendoza, J. Mujioz, E. Moreno, L. R. Pericchi. 
A. van der Linde. C. Villegas and M. West. 

We are also grateful, in more ways than one, to the State of Valencia. It has 
provided a beautiful and congenial setting for much of the writing of this book. 
And. in the person of the Governor, Joan Lerma, it has been wonderfully supportive 
of the celebrated series of Valencia International Meetings on Bayesian Statistics. 
During the secondment of one of us as scientific advisor to the Governor, it also 
provided resources to enable the writing of this book to continue. 

This volume has been produced directly in TX and we are grateful to Maria 
Dolores Tortajada for all her efforts. 

Finally, we thank past and present editors at John Wiley & Sons for their 
support of this project: Jamie Cameron for saying “Go!” and Helen Ramsey for 
saying “Stop!” 


Valencia, Spain J. M. Bernardo 
January 26, 2000 A. F. M. Smith 


Contents 


1. INTRODUCTION | 
1.1. Thomas Bayes | 
1.2. The subjectivist view of probability 2 
1.3. Bayesian Statistics in perspective 3 


1.4. An overview of Bayesian Theory 5 


1.4.1. Scope 5 

1.4.2. Foundations 5 

1.4.3. Generalisations 6 

1.4.4. Modelling 7 

1.4.5. Inference 7 

1.4.6. Remodelling 8 

1.4.7. Basic formulae 8 

1.4.8. Non-Bayesian theories 9 


1.5. A Bayesian reading list 9 


2. FOUNDATIONS 13 


2.1, 
2.2. 


2.3. 


2.4. 


2.5. 


2.6. 


2.7. 


2.8. 


Beliefs and actions 13 


Decision problems 16 


2.2.1. Basic elements 16 
2.2.2. Formal representation 18 


Coherence and quantification 23 


2.3.1. Events, options and preferences 23 
2.3.2. Coherent preferences 23 
2.3.3. Quantification 28 


Beliefs and probabilities 33 


2.4.1. Representation of beliefs 33 

2.4.2. Revision of beliefs and Bayes’ theorem 
2.4.3. Conditional independence 45 

2.4.4. Sequential revision of beliefs 47 


Actions and utilities 49 


2.5.1. Bounded sets of consequences 49 
2.5.2. Bounded decision problems = 50 
2.5.3. General decision problems 54 


Sequential decision problems 56 


2.6.1, Complex decision problems 56 
2.6.2. Backward induction 59 
2.6.3. Design of experiments 63 


Inference and information 67 


2.7.1. Reporting beliefs as a decision problem 
2.7.2. The utility of a probability distribution 
2.7.3. Approximation and discrepancy 75 
2.7.4. Information = 77 


Discussion and further references 81 


2.8.1. Operational definitions 81 

2.8.2, Quantitative coherence theories 83 
2.8.3. Related theories 85 

2.8.4. Critical issues 92 


38 


67 
69 


Contents 


Contents 


3. GENERALISATIONS — 105 


3.1. 


3.2. 


3.3. 


3.4. 


3.5, 


Generalised representation of beliefs 105 


3.1.1. Motivation 105 
3.1.2. Countable additivity 106 


Review of probability theory 109 


3.2.1. Random quantities and distributions 109 
3.2.2. Some particular univariate distributions 114 
3.2.3. Convergence and limit theorems —:125 

3.2.4. Random vectors, Bayes’ theorem 127 

3.2.5. Some particular multivariate distributions — 133 


Generalised options and utilities 141 


3.3.1. Motivation and preliminaries 141 
3.3.2. Generalised preferences 145 
3.3.3. The value of information 147 


Generalised information measures 150 


3.4.1. The general problem of reporting beliefs 150 
3.4.2. The utility of a general probability distribution 151 
3.4.3. Generalised approximation and discrepancy 154 
3.4.4. Generalised information 157 


Discussion and further references 160 


3.5.1. The role of mathematics 160 
3.5.2. Critical issues 161 


4. MODELLING 165 


4.1 


4.2. 


4.3. 


Statistical models 165 
4.1.1. Beliefs and models 165 


Exchangeability and related concepts 167 


4.2.1. Dependence and independence 167 
4.2.2. Exchangeability and partial exchangeability 168 


Models via exchangeability 172 

4.3.1. The Bernoulli and binomial models 172 
4.3.2. The multinomial model 176 

4.3.3. The general model 177 


xi 


xii 


44. 


45. 


4.6. 


4.7. 


48. 


Models via invariance 181 


4.4.1. The normal model 181 

4.4.2. The multivariate normal model —:185 
4.4.3. The exponential model —-:187 

4.4.4. The geometric model 189 


Models via sufficient statistics 190 


4.5.1. Summary statistics 190 

4.5.2. Predictive sufficiency and parametric sufficiency 
4.5.3. Sufficiency and the exponential family 197 
4.5.4. Information measures and the exponential family 


Models via partial exchangeability 209 


4.6.1. Models for extended data structures 209 
4.6.2. Several samples 211 

4.6.3. Structured layouts 217 

4.6.4. Covariates 219 

4.6.5. Hierarchical models = 222 


Pragmatic aspects 226 

4.7.1. Finite and infinite exchangeability 226 
4.7.2. Parametric and nonparametric models 228 
4.7.3. Model elaboration 229 

4.7.4. Model simplification 233 

4.7.5, Prior distributions 234 


Discussion and further references 235 


4.8.1. Representation theorems 235 
4.8.2. Subjectivity and objectivity 2 
4.8.3, Critical issues 237 


5. INFERENCE 241 


5.1. 


The Bayesian paradigm 241 


5.1.1. Observables, beliefs and models = 241 

5.1.2. The role of Bayes’ theorem 242 

5.1.3. Predictive and parametric inference 243 
5.1.4. Sufficiency, ancillarity and stopping rules = 247 
5.1.5. Decisions and inference summaries 255 
5.1.6. Implementation issues 263 


Contents 


Contents 


5.2. 


5.3. 


5.4. 


5.5. 


5.6. 


Conjugate analysis 265 


5.2.1. Conjugate families 265 
5.2.2. Canonical conjugate analysis 269 
5.2.3. Approximations with conjugate families 279 


Asymptotic analysis 285 


5.3.1. Discrete asymptotics 286 
5.3.2. Continuous asymptotics 287 
5.3.3. Asymptotics under transformations 295 


Reference analysis 298 


5.4.1. Reference decisions 299 

$.4.2. One-dimensional reference distributions 302 
5.4.3. Restricted reference distributions 316 

5.4.4. Nuisance parameters 320 

5.4.5. Multiparameter problems 333 


Numerical approximations 339 


5.5.1. Laplace approximation 340 

5.5.2. Iterative quadrature 346 

5.5.3. Importance sampling 348 

5.5.4. Sampling-importance-resampling 350 
5.5.5. Markov chain Monte Carlo 353 


Discussion and further references 356 


5.6.1. Anhistorical footnote 356 

5.6.2. Priorignorance 357 

5.6.3. Robustness 367 

5.6.4. Hierarchical and empirical Bayes 371 
5.6.5. Further methodological developments 373 
5.6.6. Critical issues 374 


6. REMODELLING = 377 


6.1. 


Model comparison 377 


6.1.1. Ranges of models 377 

6.1.2. Perspectives on model comparison 383 

6.1.3. Model comparison as a decision problem 386 
6.1.4. Zero-one utilities and Bayes factors 389 
6.1.5. General utilities 395 

6.1.6. Approximation by cross-validation 403 
6.1.7. Covariate selection 407 


xiv 


6.2. 


6.3. 


Model rejection 409 

6.2.1. Model rejection through model comparison 
6.2.2. Discrepancy measures for model rejection 
6.2.3. Zero-one discrepancies 413 

6.2.4. General discrepancies 415 


Discussion and further references 417 
6.3.1. Overview 417 


6.3.2. Modelling and remodelling 418 
6.3.3. Critical issues 418 


A. SUMMARY OF BASIC FORMULAE — 427 


A.1. 
A.2. 


Probability distributions 427 


Inferential processes 436 


B. NON-BAYESIAN THEORIES 443 


B.1. 
B.2. 


B.3. 


B.4. 


Overview 443 


Alternative approaches 445 

B.2.1. Classical decision theory 445 
B.2.2. Frequentist procedures 449 
B.2.3. Likelihood inference 454 

B.2.4. Fiducial and related theories 456 


Stylised inference problems 460 


B.3.1. Point estimation 460 

B.3.2. Interval estimation 465 
B.3.3. Hypothesis testing 469 
B.3.4. Significance testing 475 


Comparative issues 478 


B.4.1. Conditional and unconditional inference 
B.4.2. Nuisance parameters and marginalisation 
B.4.3. Approaches to prediction 482 

B.4.4. Aspects of asymptotics 485 

B.4.5. Model choice criteria 486 


REFERENCES = 489 


SUBJECT INDEX = 555 


AUTHOR INDEX = 573 


409 
412 


478 
479 


Contents 


BAYESIAN THEORY 
Edited by José M. Bernardo and Adrian F. M. Smith 
Copyright © 2000 by John Wiley & Sons, Ltd 


Chapter 1 


Introduction 


Summary 
A brief historical introduction to Bayes’ theorem and its author is given, as 
a prelude to a statement of the perspective adopted in this volume regarding 
Bayesian Statistics. An overview is provided of the material to be covered in 
successive chapters and appendices, and a Bayesian reading list is provided. 


1.1 THOMAS BAYES 


According to contemporary journal death notices and the inscription on his tomb 
in Bunhill Fields cemetery in London, Thomas Bayes died on 7th April, 1761, at 
the age of 59. The inscription on top of the tomb reads: 


Rev. Thomas Bayes. Son of the said Joshua and Ann Bayes (59). 7 April 1761. 
In recognition of Thomas Bayes’s important work in probability. The vault was 
restored in 1969 with contributions received from statisticians throughout the 
world. 


Definitive records of Bayes’ birth do not seem to exist. but. allowing for the 
calendar reform of 1752 and accepting that he died at the age of 59, it seems likely 
that he was born in 1701 (an argument attributed to Bellhouse in the /nst. Math. 
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Statist. Bull. 26, 1992). Some background on the life and the work of Bayes may 
be found in Barmard (1958), Holland (1962), Pearson (1978), Gillies (1987), Dale 
(1990, 1991) and Earman (1990). See, also, Stigler (1986a). 

That his name lives on in the characterisation of a modern statistical method- 
ology is a consequence of the publication of An essay towards solving a problem in 
the doctrine of chances, attributed to Bayes and communicated to the Royal Society 
after Bayes’ death by Richard Price in 1763 (Phil. Trans. Rov. Soc. 53, 370-418). 

The technical result at the heart of the essay is what we now know as Baves' 
theorem. However, from a purely formal perspective there is no obvious reason 
why this essentially trivial probability result should continue to excite interest. 

In its simplest form, if H denotes an hypothesis and J) denotes data, the 
theorem states that 


P(H|D) = P(D|H) x P(A)/P(D). 


With P( 7) regarded as a probabilistic statement of belief about H before obtaining 
data D, the left-hand side P(H | D) becomes a probabilistic statement of belief 
about H after obtaining D. Having specified P( | H) and P(D), the mechanism 
of the theorem provides a solution to the problem of how to learn from data. 

Actually, Bayes only stated his result for a uniform prior. According to Stigler 
(1986b), it was Laplace (1774/1986) — apparently unaware of Bayes” work — who 
stated the theorem in its general (discrete) form. 

Like any theorem in probability, at the technical level Bayes’ theorem merely 
provides a form of “uncertainty accounting”, which asserts that the left-hand side of 
the equation must equal the right-hand side. The interest and controversy, of course, 
lie in the interpretation and assumed scope of the formal inputs to the two sides of 
the equation — and it is here that past and present commentators part company in 
their responses to the idea that Bayes’ theorem can or should be regarded as a central 
feature of the statistical learning process. At the heart of the controversy is the issue 
of the philosophical interpretation of probability — objective or subjective? — and 
the appropriateness and legitimacy of basing a scientific theory on the latter. 

What Thomas Bayes — from the tranquil surroundings of Bunhill Fields, where 
he lies in peace with Richard Price for company — has made of all the fuss over the 
last 233 years we shall never know. We would like to think that he is a subjectivist 
fellow-traveller but, in any case, he is in no position to complain at the liberties we 
are about to take in his name. 


1.2 THE SUBJECTIVIST VIEW OF PROBABILITY 


Throughout this work, we shall adopt a wholehearted subjectivist position regarding 
the interpretation of probability. The definitive account and defence of this position 
are given in de Finetti’s two-volume Theory of Probability (1970/1974, 1970/1975) 
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and the following brief extract from the Preface to that work perfectly encapsulates 
the essence of the case. 


The only relevant thing is uncertainty—the extent of our own knowledge and 
ignorance. The actual fact of whether or not the events considered are in some 
sense determined, ot known by other people, and so on, is of no consequence. 

The numerous, different, opposed attempts to put forward particular points 
of view which, inthe opinion of their supporters, would endow Probability Theory 
with a ‘nobler’ status, or a ‘more scientific’ character, or ‘firmer’ philosophical 
or logical foundations, have only served to generate confusion and obscurity, and 
to provoke well-known polemics and disagreements —even between supporters 
of essentially the same framework. 

The main points of view that have been put forward are as follows. 

The classical view, based on physical considerations of symmetry, in which 
one should be obliged to give the same probability to such ‘symmetric’ cases. But 
which symmetry? And, in any case, why? The original sentence becomes mean- 
ingful if reversed: the symmetry is probabilistically significant, in someone's 
opinion, if it leads him to assign the same probabilities to such events. 

The /ogical view is similar, but much more superficial and irresponsible 
inasmuch as it is based on similarities or symmetries which no Jonger derive 
from the facts and their actual properties, but merely from the sentences which 
describe them, and from their formal structure or language. 

The frequentist (or statistical) view presupposes that one accepts the clas- 
sical view, in that it considers an event as a class of individual events, the latter 
being ‘trials’ of the former. The individual events not only have to be ‘equally 
probable’, but also ‘stochastically independent’ . . . (these notions when applied 
to individual events are virtually impossible to define or explain in terms of the 
frequentist interpretation). In this case, also, it is straightforward, by means of 
the subjective approach, to obtain, under the appropriate conditions, in a perfectly 
valid manner, the result aimed at (but unattainable) in the statistical formulation. 
It suffices to make use of the notion of exchangeability. The result, which acts as 
a bridge connecting the new approach with the old, has often been referred to by 
the objectivists as “de Finetti’s representation theorem”. 

It follows that all the three proposed definitions of ‘objective’ probability, 
although useless per se, turn out to be useful and good as valid auxiliary devices 
when included as such in the subjectivist theory. 

(de Finetti, 1970/1974, Preface, xi—xii) 


1.3. BAYESIAN STATISTICS IN PERSPECTIVE 


The theory and practice of Statistics span a range of diverse activities, which are 
motivated and characterised by varying degrees of formal intent. Activity in the 
context of initial data exploration is typically rather informal; activity relating to 
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concepts and theories of evidence and uncertainty is somewhat more formally struc- 
tured; and activity directed at the mathematical abstraction and rigorous analysis 
of these structures is intentionally highly formal. 


What is the nature and scope of Bayesian Statistics within this spectrum of 
activity? 

Bayesian Statistics offers a rationalist theory of personalistic beliefs in contexts 
of uncertainty, with the central aim of characterising how an individual should act 
in order to avoid certain kinds of undesirable behavioural inconsistencies. The 
theory establishes that expected utility maximisation provides the basis for rational 
decision making and that Bayes’ theorem provides the key to the ways in which 
beliefs should fit together in the light of changing evidence. The goal. in effect, 
is to establish rules and procedures for individuals concerned with disciplined 
uncertainty accounting. The theory is not descriptive, in the sense of claiming to 
model actual behaviour. Rather, it is prescriptive, in the sense of saying “if you 
wish to avoid the possibility of these undesirable consequences you must act in the 
following way”. 


From the very beginning, the development of the theory necessarily presumes 
a rather formal frame of discourse. within which uncertain events and available 
actions can be described and axioms of rational behaviour can be stated. But this 
formalism is preceded and succeeded in the scientific learning cycle by activities 
which, in our view. cannot readily be seen as part of the formalism. 


In any field of application. a prerequisite for arriving at a structured frame of 
discourse will typically be an informal phase of exploratory data analysis. Also. it 
can happen that evidence arises which discredits a previously assumed and accepted 
formal framework and necessitates a rethink. Part of the process of realising that 
a change is needed can take place within the currently accepted framework using 
Bayesian ideas, but the process of rethinking is again outside the formalism. Both 
these phases of initial structuring and subsequent restructuring might well be guided 
by “Bayesian thinking” —by which we mean keeping in mind the objective of 
creating or re-creating a formal framework for uncertainty analysis and decision 
making — but are not themselves part of the Bayesian formalism. That said. there 
is. of course, often a pragmatic ambiguity about the boundaries of the formal and 
the informal. 


The emphasis in this book is on ideas and we have sought throughout to keep 
the level of the mathematical treatment as simple as is compatible with giving what 
we regard as an honest account. However, there are sections where the full story 
would require a greater level of abstraction than we have adopted, and we have 
drawn attention to this whenever appropriate. 
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1.4 AN OVERVIEW OF BAYESIAN THEORY 
1.4.1 Scope 


This volume on Bayesian Theory focuses on the basic concepts and theory of 
Bayesian Statistics, with chapters covering elementary Foundations, mathematical 
Generalisations of the Foundations, Modelling, Inference and Remodelling. In 
addition, there are two appendices providing a Summary of Basic Formulae and a 
review of Non-Bayesian Theories. The emphasis throughout is on general ideas — 
the Why? — of Bayesian Statistics. A detailed treatment of analytical and numerical 
techniques for implementing Bayesian procedures — the How? — will be provided in 
the volume Bayesian Computation. A systematic study of the methods of analysis 
fora wide range of commonly encountered model and problem types—the What? — 
will be provided in the volume Bavesian Methods. 

The selection of topics and the details of approach adopted in this volume 
necessarily reflect our own preferences and prejudices. Where we hold strong 
views, these are, for the most part, rather clearly and forcefully stated, while, 
hopefully, avoiding too dogmatic a tone. We acknowledge, however, that even 
colleagues who are committed to the Bayesian paradigm will disagree with at least 
some points of detail and emphasis in our account. For this reason, and to avoid 
complicating the main text with too many digressionary asides and references, each 
of Chapters 2 to 6 concludes with a Discussion and Further References section, in 
which some of the key issues in the chapter are critically re-examined. 

In most cases, the omission of a topic, or its abbreviated treatment in this 
volume, reflects the fact that a detailed treatment will be given in one or other of 
the volumes Bayesian Computation and Bayesian Methods. Topics falling into this 
category include Design of Experiments, Image Analysis, Linear Models, Multivari- 
ate Analysis, Nonparametric Inference, Prior Elicitation, Robustness, Sequential 
Analysis, Survival Analysis and Time Series. However, there are important topics, 
such as Game Theory and Group Decision Making, which are omitted simply be- 
cause a proper treatment seemed to us to involve too much of a digression from our 
central theme. For a convenient source of discussion and references at the interface 
of Decision Theory and Game Theory, see French (1986). 


1.4.2 Foundations 


In Chapter 2, the concept of rationality is explored in the context of representing 
beliefs or choosing actions in situations of uncertainty. We introduce a formal 
framework for decision problems and an axiom system for the foundations of 
decision theory, which we believe to have considerable intuitive appeal and to be 
an improvement on the many such systems that have been previously proposed. 
Here, and throughout this volume, we stress the importance of a decision-oriented 
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framework in providing a disciplined setting for the discussion of issues relating to 
uncertainty and rationality. 

The dual concepts of probability and utility are formally defined and analysed 
within this decision making context and the criterion of maximising expected utility 
is shown to be the only decision criterion which is compatible with the axiom sys- 
tem. The analysis of sequential decision problems is shown to reduce to successive 
applications of the methodology introduced. 

A key feature of our approach is that statistical inference is viewed simply as 
a particular form of decision problem; specifically, a decision problem where an 
action corresponds to reporting a probability belief distribution for some unknown 
quantity of interest. Thus defined, the inference problem can be analysed within 
the general decision theory framework, rather than requiring a separate “theory of 
inference”. 

An important special feature of what we shall call a pure inference problem is 
the form of utility function to be adopted. We establish that the logarithmic utility 
function — more often referred to as a score function in this context — plays a special 
role as the natural utility function for describing the preferences of an individual 
faced with a pure inference problem. 

Within this framework, measures of the discrepancy between probability dis- 
tributions and the amount of information contained in a distribution are naturally 
defined in terms of expected loss and expected increase, respectively, in loga- 
rithmic utility. These measures are mathematically closely related to well-known 
information-theoretic measures pioneered by Shannon (1948) and employed in 
statistical contexts by Kullback (1959/1968). A resulting characteristic feature of 
our approach is therefore the systematic appearance of these information-theoretic 
quantities as key elements in the Bayesian analysis of inference and general decision 
problems. 


1.4.3 Generalisations 


In Chapter 3, the ideas and results of Chapter 2 are extended to a much more 
general mathematical setting. An additional postulate concerning the comparison 
of a countable collection of events is appended to the axtom system of Chapter 2, 
and is shown to provide a justification for restricting attention to countably additive 
probability as the basis for representing beliefs. The elements of mathematical 
probability theory required in our subsequent development are then reviewed. 

The notions of actions and utilities, introduced in a simple discrete setting in 
Chapter 2, are extended in a natural way to provide a very general mathematical 
framework for our development of decision theory. A further additional mathe- 
matical postulate regarding preferences is introduced and. within this more general 
framework, the criterion of maximising expected utility is shown to be the only 
decision making criterion compatible with the extended axiom system. 
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In this generalised setting, inference problems are again considered simply as 
special cases of decision problems and generalised definitions of score functions 
and measures of information and discrepancy are given. 


1.4.4 Modelling 


In Chapter 4, we examine in detail the role of familiar mathematical forms of sta- 
tistical models and the possible justifications —from a subjectivist perspective — 
for their use as representations of actual beliefs about observable random quanti- 
ties. A feature of our approach is an emphasis on the primacy of observables and 
the notion of a model as a (probabilistic) prediction device for such observables. 
From this perspective, the role of conventional parametric statistical modelling is 
problematic, and requires fundamental re-examination. 

The problem is approached by considering simple structural characteristics 
—such as symmetry with respect to the labelling of individual counts or measure- 
ments, a feature common to many individual beliefs about sequences of observables. 
The key concept here is that of exchangeability, which we motivate, formalise and 
then use to establish a version of de Finetti’s celebrated representation theorem. 
This demonstrates that judgements of exchangeability lead to general mathematical 
representations of beliefs that justify and clarify the use and interpretations of such 
familiar statistical concepts as parameters, random samples, likelihoods and prior 
distributions. 

Going beyond simple exchangeability, we show that beliefs which have cer- 
tain additional invariance properties — for example, to rotation of the axes of mea- 
surements, or translation of the origin—can lead to mathematical representations 
involving other familiar specific forms of parametric distributions, such as normals 
and exponentials. 

A further approach to characterising belief distributions is considered, based 
on data reduction. The concept of a sufficient statistic is introduced and related to 
representations involving the exponential family of distributions. 

Various forms of partial exchangeability judgements about data structures are 
then discussed in a number of familiar contexts and links are established with a 
number of other commonly used statistical models. Structures considered include 
those of several samples, multiway layouts, problems involving covariates, and 
hierarchies. 


1.4.5 Inference 


In Chapter 5. the key role of Bayes’ theorem in the updating of beliefs about ob- 
servables in the light of new information is identified and related to conventional 
mechanisms of predictive and parametric inference. The roles of sufficiency, an- 
cillarity and stopping rules in such inference processes are also examined. 
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Various standard forms of statistical problems, such as point and interval 
estimation and hypothesis testing, are re-examined within the general Bayesian 
decision framework and related to formal and informal inference summaries. 

The problems of implementing Bayesian procedures are discussed at length. 
The mathematical convenience and elegance of conjugate analysis are illustrated in 
detail, as are the mathematical approximations available under the assumption of 
the validity of large-sample asymptotic analysis. A particular feature of this volume 
is the extended account of so-called reference analysis, which can be viewed as a 
Bayesian formalisation of the idea of “letting the data speak for themselves”. An 
alternative, closely related idea is that of how to represent “vague beliefs™ or “igno- 
rance”. We provide a detailed historical review of attempts that have been made to 
solve this problem and compare and contrast some of these with the reference anal- 
ysis approach. A brief account is given of recent analytic approximation strategies 
derived from Laplace-type methods, together with outline accounts of numerical 
quadrature, importance sampling. sampling-importance-resampling, and Markov 
chain Monte Carlo methods. 


1.4.6 Remodelling 


In Chapter 6, it is argued that, whether viewed from the perspective of a sensitive 
individual modeller or from that of a group of modellers. there are good reasons for 
systematically entertaining a range of possible belief models, rather than predicating 
all analysis on a single assumed model. 

A variety of decision problems are examined within this framework, some 
involving model choice only, some involving model choice followed by a terminal 
action, such as prediction, others involving only a terminal action. 

A feature of our treatment of this topic is that, throughout. a clear distinc- 
tion is drawn among three rather different perspectives on the comparison of and 
choice from among a range of competing models. The first perspective arises when 
the range of models under consideration is assumed to include the “true” model. 
The second perspective arises when the range of models is assumed to be under 
consideration in order to provide a more conveniently implemented proxy for an 
actual, but intractable, belicf model. The third perspective arises when the range 
of models is under consideration because the models are “all there is available”. in 
the absence of any specification of an actual belief model. Our discussion relates 
and links these ideas with aspects of hypothesis testing. significance testing and 
cross-validation. 


1.4.7. Basic Formulae 


In Appendix A. we collect together for convenience, in tabular format. summaries 
of the main univariate and multivariate probability distributions that appear in the 
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text, together with summaries of the prior/posterior/predictive forms corresponding 
to these distributions in the context of conjugate and reference analyses. 


1.4.8 Non-Bayesian Theories 


In Appendix B, we review what we perceive to be the main alternatives to the 
Bayesian approach; namely, classical decision theory, frequentist procedures, like- 
lihood theory, and fiducial and related theories. 

We compare and contrast these alternatives in the context of “stylised” infer- 
ence problems such as point and interval estimation, hypothesis and significance 
testing. Through counter-examples and general discussion, we indicate why we 
find all these alternatives seriously deficient as formal inference theories. 


1.5 A BAYESIAN READING LIST 


As we have already remarked, this work is—necessarily —a selective account of 
Bayesian theory, reflecting our own interests and perspectives. The following is 
a list of other Bayesian books —by no means exhaustive— whose contents would 
provide a significant complement to the material in this volume. 

In those cases where there are several editions. or when the original is not in 
English, we quote both the original date and the date of the most recent English 
edition. Thus, Jeffreys (1939/1961) refers to Jeffreys’ Theory of Probability, first 
published in 1939, and to its most recent (3rd) edition, published in 1961; similarly, 
de Finetti (1970/1974) refers to the original (1970) Italian version of de Finetti’s 
Teoria delle Probabilita vol. | and to its English translation (published in 1974). 

Pioneering Bayesian books include Laplace (1812), Keynes (1921/1929), Jef- 
freys (1939/1961), Good (1950, 1965), Savage (1954/1972, 1962), Schlaifer (1959, 
1961), Raiffa and Schlaifer (1961), Mosteller and Wallace (1964/1984), Dubins and 
Savage (1965/1976), Lindley (1965, 1972), Pratt et al. (1965), Tribus (1969), De- 
Groot (1970), de Finetti (1970/1974, 1970/1975, 1972) and Box and Tiao (1973). 

Elementary and intermediate Bayesian textbooks include those of Savage 
(1968), Schmitt (1969), Lavalle (1970), Lindley (1971/1985), Winkler (1972), 
Kleiter (1980), Bernardo (1981b), Daboni and Wedlin (1982), Iversen (1984), 
O'Hagan (1988a), Cifarelli and Muliere (1989), Lee (1989), Press (1989), Scoz- 
zafava (1989), Wichmann (1990), Borovenik (1992), and Berry (1996). 

More advanced Bayesian monographs include Hartigan (1983), Regazzini 
(1983), Berger (1985a), Savchuk (1989), Florens er al. (1990), Robert (1992) and 
O’ Hagan (1994a). Polson and Tiao (1995) is a two volume collection of classic 
papers in Bayesian inference. 

Special topics have also been examined from a Bayesian point of view; these 
include Actuarial Science (Klugman, 1992), Biostatistics (Girelli-Bruni, 1981; 
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Lecoutre, 1984, Barrai et al., 1992: Berry and Stangl, 1996), Control Theory 
(Aoki, 1967; Sawagari et al., 1967). Decision Analysis (Duncan and Raiffa, 1957: 
Chernoff and Moses, 1959; Grayson, 1960: Fellner. 1965; Roberts, 1966; Ed- 
wards and Tversky. 1967; Hadley, 1967: Martin, 1967: Morris, 1968; Raiffa. 1968; 
Lusted. 1968; Schlaifer, 1969; Aitchison, 1970; Fishburn, 1970, 1982; Halter 
and Dean, 1971; Lindgren, 1971; Keeney and Raiffa, 1976; Rios. 1977: Lavalle. 
1978; Roberts, 1979; French er al. 1983; French, 1986, 1989: Marinell and Seeber, 
1988; Smith, 1988a), Dynamic Forecasting (Spall. 1988, West and Harrison. 1989; 
Pole et al., 1994), Economics and Econometrics (Morales, 1971. Zellner. 1971; 
Richard, 1973; Bauwens, 1984; Boyer and Kihistrom, 1984: Cyert and DeGroot. 
1987), Educational and Psychological Research (Novick and Jackson. 1974; Pol- 
lard, 1986). Foundations (Fishburn, 1964, 1970. 1982, 1987, 1988a: Berger and 
Wolpert, 1984/1988; Brown, 1985), History (Dale, 1991), Information Theory (Ya- 
glom and Yaglom, 1960/1983: Osteyee and Good. 1974), Law and Forensic Science 
(DeGroot er al., 1986; Aitken and Stoney. 1991), Linear Models (Lempers, 1971; 
Leamer, 1978: Pilz, 1983/1991: Broemeling. 1985), Logic and Philosophy of Sci- 
ence (Jeffrey. 1965/1983, 1981; Rosenkranz, 1977; Seidenfeld, 1979: Howson and 
Urbach 1989; Verbraak, 1990; Rivadulla. 1991), Maximum Entropy (Levine and 
Tribus, 1978; Smith and Grandy, 1985; Justice. 1987: Smith and Erickson, 1987; Er- 
ickson and Smith, 1988. Skilling, 1989; Fougére, 1990; Grandy and Schick, 1991, 
Kapur and Kesavan, 1992; Mohammad-Djafari and Demoment, 1993), Multivariate 
Analysis (Press, 1972/1982: Berger and DasGupta. 1991). Optimisation (Mockus. 
1989), Pattern Recognition (Simon, 1984), Prediction (Aitchison and Dunsmore. 
1975; Geisser, 1993). Probability Assessment (Stael von Holstein. 1970; Stael von 
Holstein and Matheson. 1979; Cooke. 1991), Reliability (Martz and Waller. 1982: 
Claroti and Lindley, 1988), Sample Surveys (Rubin, 1987). Social Science (Phillips. 
1973) and Spectral Analysis (Bretthorst. 1988). 

A number of collected works also include a wealth of Bayesian material. 
Among these, we note particularly; Kyburg and Smokler (1964/1980), Meyer and 
Collier (1970), Godambe and Sprott (1971). Fienberg and Zellner (1974). White and 
Bowen (1975), Aykac and Brumat (1977), Parenti (1978), Zellner (1980). Savage 
(1981), Box et al. (1983), Dawid and Smith (1983), Florens er al. (1983. 1985), 
Good (1983), Jaynes (1983), Kadane (1984), Box (1985), Goel and Zellner (1986). 
Smith and Dawid (1987), Viertl (1987), Gardenfors and Sahlin (1988), Gupta and 
Berger (1988, 1994). Geisser et al. (1990), Hinkelmann (1990), Oliver and Smith 
(1990), Ghosh and Pathak (1992), Goel and Iyengar ($992), de Finetti (1993). Fearn 
and O’ Hagan (1993), Gatsonis et al. (1993). Freeman and Smith (1994). and last. 
but not least, the Proceedings of the Valencia International Meetings on Bayesian 
Statistics (Bernardo et al. 1980, 1985, 1988, 1992, 1996 and 1999). 

General discussions of Bayesian Statistics may be found in review papers and 
encyclopedia articles. Among these. we note de Finetti (1951), Lindley (1953. 
1976, 1978, 1982b, 1982c. 1984, 1990. 1992). Anscombe (1961), Savage (1961. 
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1970), Edwards er al. (1963), Bartholomew (1965), Corfield (1969), Good (1976, 
1992), Roberts (1978), DeGroot (1982), Dawid (1983a), Smith (1984), Zellner 
(1985, 1987, 1988a), Pack (1986a, 1986b), Cifarelli (1987), Bernardo (1989), 
Ghosh (1991) and Berger (1993). 

For discussion of important specific topics, see Luce and Suppes (1965), Birn- 
baum (1968, 1978), de Finetti (1968), Press (1980a, 1985a), Fishburn (1981, 1986, 
1988b), Dickey (1982), Geisser (1982, 1986), Good (1982, 1985, 1987, 1988a, 
1988b), Dawid (1983b, 1986a, 1992), Joshi (1983), LaMotte (1985), Genest and 
Zidek (1986), Goldstein (1986c), Racine-Poon et al. (1986), Hodges (1987), Eric- 
son (1988), Zellner (1988c), Trader (1989), Breslow (1990), Lindley (1991), Smith 
(1991), Barlow and Irony (1992), Ferguson er al. (1992), Arnold (1993), Kadane 
(1993), Bartholomew (1994), Berger (1994) and Hill (1994). 
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Chapter 2 


Foundations 


Summary 


The concept of rationality is explored in the context of representing beliefs or 
choosing actions in situations of uncertainty. An axiomatic basis, with intuitive 
operational appeal, is introduced for the foundations of decision theory. The dual 
concepts of probability and utility are formally defined and analysed within this 
context. The criterion of maximising expected utility is shown to be the only 
decision criterion which is compatible with the axiom system. The analysis of 
sequential decision problems is shown to reduce to successive applications of the 
methodology introduced. Statistical inference is viewed as a particular decision 
problem which may be analysed within the framework of decision theory. The 
logarithmic score is established as the natural utility function to describe the 
preferences of an individual faced with a pure inference problem. Within this 
framework, the concept of discrepancy between probability distributions and the 
quantification of the amount of information in new data are naturally defined in 
terms of expected loss and expected increase in utility, respectively. 


2.1. BELIEFS AND ACTIONS 


We spend a considerable proportion of our lives, both private and professional, in 
a state of uncertainty. This uncertainty may relate to past situations, where direct 
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knowledge or evidence is not available. or has been lost or forgotten: or to present 
and future developments which are not yet completed. Whatever the circumstances, 
there is a sense in which all states of uncertainty may be described in the same way: 
namely, an individual feeling of incomplete knowledge in relation to a specified 
situation (a feeling which may, of course, be shared by other individuals). And yet 
it is obvious that we do not attempt to treat all our individual uncertainties with the 
same degree of interest or seriousness. 

Many feelings of uncertainty are rather insubstantial and we neither seek to 
analyse them, nor to order our thoughts and opinions in any kind of responsible 
way. This typically happens when we feel no actual or practical involvement with 
the situation in question. In other words, when we feel that we have no (or only 
negligible) capacity to influence matters, or that the possible outcomes have no (or 
only negligible) consequences so far as we are concerned. In such cases. we are not 
motivated to think carefully about our uncertainty either because nothing depends 
on it, or the potential effects are trivial in comparison with the effort involved in 
carrying out a conscious analysis. 

On the other hand, we all regularly encounter uncertain situations in which 
we at least aspire to behave “rationally” in some sense. This might be because we 
face the direct practical problem of choosing from among a set of possible actions. 
where each involves a range of uncertain consequences and we are concerned to 
avoid making an “illogical” choice. Alternatively, we might be called upon to 
summarise our beliefs about the uncertain aspects of the situation, bearing in mind 
that others may subsequently use this summary as the basis for choosing an action. 
In this case, we are concerned that our summary be in a form which will enable 
a “rational” choice to be made at some future time. More specifically. we might 
regard the summary itself, i.e., the choice of a particular mode of representing and 
communicating our beliefs. as being a form of action to which certain criteria of 
“rationality” might be directly applied. 

Our basic concern in this chapter is with exploring the concept of “rationality” 
in the context of representing beliefs or choosing actions in situations of uncertainty. 
To choose the best among a set of actions would. in principle, be immediate if we 
had perfect information about the consequences to which they would lead. So far 
as this work is concerned, interesting decision problems are those for which such 
perfect information is not available, and we must take uncertainty into account as 
a major feature of the problem. 

It might be argued that there are complex situations where we do have complete 
information and yet still find it difficult to take the best decision. Here. however, 
the difficulty is rechnical, not conceptual. For example. even though we have. 
in principle, complete information, it is typically not easy to decide what is the 
optimal strategy to rebuild a Rubik cube or which is the cheapest diet fulfilling 
specified nutritional requirements. We take the view that such problems are purely 
technical. In the first case. they result from the large number of possible strategies: 
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in the second, they reduce to the mathematical problem of finding a minimum 
under certain constraints. But in neither case is there any doubt about the decision 
criterion to be used. In this work we shall not consider these kinds of combinatorial 
or mathematical programming problems, and we shall assume that in the presence 
of complete information we can, in principle, always choose the best alternative. 

Our concern, instead, is with the logical process of decision making in sit- 
uations of uncertainty. In other words, with the decision criterion to be adopted 
when we do not have complete information and are thus faced with, at least some, 
elements of uncertainty. 

To avoid any possible confusion, we should emphasise that we do not interpret 
“actions in situations of uncertainty” in a narrow, directly “economic” sense. For 
example, within our purview we include the situation of an individual scientist 
summarising his or her own current beliefs following the results of an experiment; 
or trying to facilitate the task of others seeking to decide upon their beliefs in the 
light of the experimental results. 

[t is assumed in our approach to such problems that the notion of “rational 
belief” cannot be considered separately from the notion of “rational action”. Either 
a statement of beliefs in the light of available information is, actually or potentially, 
an input into the process of choosing some practical course of action, 


. it is not asserted that a belief... does actually lead to action, but would lead 
to action in suitable circumstances; just as a lump of arsenic is called poisonous 
not because it actually has killed or will kill anyone, but because it would kill 
anyone if he ate it (Ramsey, 1926). 


or, alternatively, a statement of beliefs might be regarded as an end in itself, in 
which case the choice of the form of statement to be made constitutes an action, 


Frequently, it is a question of providing a convenient summary of the data... 
In such cases, the emphasis is on the inference rather than the decision aspect of 
problem, although formally it can still be considered a decision problem if the 
inferential statement itself is interpreted as the decision to be taken (Lehmann, 
1959/1986). 


We can therefore explore the notion of “rationality” for both beliefs and actions 
by concentrating on the latter and asking ourselves what kinds of rules should govern 
preference patterns among sets of alternative actions in order that choices made in 
accordance with such rules commend themselves to us as “rational”, in that they 
cannot lead us into forms of behavioural inconsistency which we specifically wish 
to avoid. 

In Section 2.2, we describe the general structure of problems involving choices 
under uncertainty and introduce the idea of preferences between options. In Sec- 
tion 2.3, we make precise the notion of “rational” preferences in the form of axioms. 
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We describe these as principles of quantitative coherence because they specify the 
ways in which preferences need to be made quantitatively precise and fit together, 
or cohere, if “illogical” forms of behaviour are to be avoided. In Sections 2.4 and 
2.5, we prove that, in order to conform with the principles of quantitative coherence, 
degrees of belief about uncertain events should be described in terms of a (finitely 
additive) probability measure, relative values of individual possible consequences 
should be described in terms of a utility function. and the rational choice of an 
action is to select one which has the maximum expected utility. 

In Section 2.6, we discuss sequential decision problems and show that their 
analysis reduces to successive applications of the maximum expected utility meth- 
odology: in particular, we identify the design of experiments as a particular case 
of a sequential decision problem. In Section 2.7, we make precise the sense in 
which choosing a form of a statement of beliefs can be viewed as a spectal case 
of a decision problem. This identification of inference as decision provides the 
fundamental justification for beginning our development of Bayesian Statistics with 
the discussion of decision theory. Finally. a general review of ideas and references 
is given in Section 2.8. 


2.2 DECISION PROBLEMS 


2.2.1 Basic Elements 


We shall describe any situation in which choices are to be made among alterna- 
tive courses of action with uncertain consequences as a decision problem. whose 
structure is determined by three basic elements: 


(i) aset {a;. ¢ € T} of available actions, one of which is to be selected; 


(ii) for each action a,, a set {E;. j € J} of uncertain events, describing the 

uncertain outcomes of taking action a,; 
(iii) corresponding to each set {E,. j € J}. a set of consequences {c,. j € J}. 

The idea is as follows. Suppose we choose action a;; then one and only one of 
the uncertain events E. j € .J, occurs and leads to the corresponding consequence 
cj. j € J. Each set of events {E,. j € J} forms a partition (an exclusive and 
exhaustive decomposition) of the total set of possibilities. Naturally. both the set 
of consequences and the partition which labels them may depend on the particular 
action considered, so that a more precise notation would be {Fj,. j € .J;} and 
{cj;. ) € Ji} for each action a,. However, to simplify notation, we shall omit this 
dependence, while remarking that it should always be borne in mind. We shail 
come back to this point in Section 2.6. 

In practical problems, the labelling sets, 7 and .J (for each /), are typically 
finite. In such cases, the decision problem can be represented schematically by 
means of a decision tree as shown in Figure 2.1. 
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Figure 2.1 Decision tree 


The square represents a decision node, where the choice of an action is re- 
quired. The circle represents an uncertainty node where the outcome is beyond our 
control. Following the choice of an action and the occurrence of a particular event, 
the branch leads us to the corresponding consequence. 

Of course, most practical problems involve sequential considerations but, as 
shown in Section 2.6, these reduce, essentially, to repeated analyses based on the 
above structure. 

It is clear, either from our general discussion, or from the decision tree rep- 
resentation, that we can formally identify any a,, 7 € 7, with the combination of 
{E;. j € J} and {c,, j € J} to which it leads. In other words, to choose a; is 
to opt for the uncertain scenario labelled by the pairs (£j,c;), 3 € J. We shall 
write a; = {c;| Ej, 7 € J} to denote this identification, where the notation c; | E; 
signifies that event E; leads to consequence c;, i.e., that a;(Ej) = c;. 

An individual’s perception of the state of uncertainty resulting from the choice 
of any particular a; is very much dependent on the information currently available. 
In particular, {£;, 7 € J} forms a partition of the total set of relevant possibilities 
as the individual decision-maker now perceives them to be. Further information, 
of a kind which leads to a restriction on what can be regarded as the total set of 
possibilities, will change the perception of the uncertainties, in that some of the 
E's may become very implausible (or even logically impossible) in the light of 
the new information, whereas others may become more plausible. It is therefore 
of considerable importance to bear in mind that a representation such as Figure 2.1 
only captures the structure of a decision problem as perceived at a particular point 
in time. Preferences about the uncertain scenarios resulting from the choices of 
actions depend on attitudes to the consequences involved and assessments of the 
uncertainties attached to the corresponding events. The latter are clearly subject to 
change as new information is acquired and this may well change overall preferences 
among the various courses of action. 

The notion of preference is, of course, very familiar in the everyday context 
of actual or potential choice. Indeed, an individual decision-maker often prefaces 
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an actual choice (from a menu, an investment portfolio, a range of possible forms 
of medical treatment, a textbook of statistical methods, etc.) with the phrase “I 
prefer... ” (caviar, equities, surgery, Bayesian procedures, etc.). To prefer action 
a, to action @2 means that if these were the only two options available @,; would 
be chosen (conditional, of course. on the information available at the time). In 
everyday terms, the idea of indifference between two courses of action also has 
a clear operational meaning. It signifies a willingness to accept an externally 
determined choice (for example, letting a disinterested third party choose. or tossing 
a coin). 

In addition to representing the structure of a decision problem using the three 
elements discussed above, we must also be able to represent the idea of preference 
as applied to the comparison of some or all of the pairs of available options. We 
shall therefore need to consider a fourth basic element of a decision problem: 


(iv) the relation < , which expresses the individual decision-maker’s preferences 


between pairs of available actions, so that a; < a» signifies that a, iy not 
preferred to ay. 


These four basic elements have been introduced in a rather informal manner. 
In order to study decision problems in a precise way, we shall need to reformulate 
these concepts in a more formal framework. The development which follows. here 
and in Section 3.3, is largely based on Bernardo, Ferrandiz and Smith (1985). 


2.2.2 Formal Representation 


When considering a particular, concrete decision problem, we do not usually con- 
fine our thoughts to on/y those outcomes and options explicitly required for the 
specification of that problem. Typically. we expand our horizons to encompass 
analogous problems, which we hope will aid us in ordering our thoughts by pro- 
viding suggestive points of reference or comparison. The collection of uncertain 
scenarios defined by the original concrete problem is therefore implicitly embedded 
in a somewhat wider framework of actual and hypothetical scenarios. We begin by 
describing this wider frame of discourse within which the comparisons of scenarios 
are to be carried out. It is to be understood that the initial specification of any such 
particular frame of discourse, together with the preferences among options within 
it, are dependent on the decision-maker’s overall state of information at that time. 
Throughout, we shall denote this initial state of mind by Ado. 

We now give a formal definition of a decision problem. This will be presented 
in a rather compact form, detailed elaboration is provided in the remarks following 
the definition. 


Definition 2.1. (Decision problem). A decision problem is defined by the 
elements (E.C.A, <), where: 


(i) € is an algebra of relevant events, E;; 
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(ii) C is a set of possible consequences, c;; 


(iii) A is a set of options, or potential acts, consisting of functions which map 
finite partitions of Q, the certain event in E, to compatibly-dimensioned, 
ordered sets of elements of C; 


(iv) < is a preference order, taking the form of a binary relation between 
some of the elements of A. 


We now discuss each of these elements in detail. Within this wider frame of 
discourse, an individual decision-maker will wish to consider the uncertain events 
judged to be relevant in the light of the initial state of information Afp. However, it 
is natural to assume that if EF, € € and E2 € E are judged to be relevant events then 
it may also be of interest to know about their joint occurrence, or whether at least 
one of them occurs. This means that £, 9 E2 and E, U E> should also be assumed 
to belong to €. Repetition of this argument suggests that € should be closed under 
the operations of arbitrary finite intersections and unions. Similarly, it is natural 
to require E to be closed under complementation, so that E“ € €. In particular, 
these requirements ensure that the certain event Q and the impossible event 0, both 
belong to €. Technically, we are assuming that the class of relevant events has the 
structure of an algebra. (However, it can certainly be argued that this is too rigid 
an assumption. We shall provide further discussion of this and related issues in 
Section 2.8.4.) 

As we mentioned when introducing the idea of a wider frame of discourse, 
the algebra € will consist of what we might call the real-world events (that is, 
those occurring in the structure of any concrete, actual decision problem that we 
may wish to consider), together with any other hypothetical events, which it may 
be convenient to bring to mind as an aid to thought. The class € will simply be 
referred to as the algebra of (relevant) events. 

We denote by C the set of all consequences that the decision-maker wishes to 
take into account; preferences among such consequences will later be assumed to 
be independent of the state of information concerning relevant events. The class C 
will simply be referred to as the ser of (possible) consequences. 

In our introductory discussion we used the term action to refer to each po- 
tential act available as a choice at a decision node. Within the wider frame of 
discourse, we prefer the term option, since the general, formal framework may in- 
clude hypothetical scenarios (possibly rather far removed from potential concrete 
actions). 

So far as the definition of an option as a function is concerned, we note that 
this is a rather natural way to view options from a mathematical point of view: an 
option consists precisely of the linking of a partition of 2, {E,. j € J}, witha 
corresponding set of consequences, {c;. 7 € J}. To represent such a mapping 
we shall adopt the notation {c; | £;. 7 € J}. with the interpretation that event E; 
leads to consequence cj, j € J. 
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It follows immediately from the definition of an option that the ordering of the 
labels within J is irrelevant, so that, for example, the options {c) | E. c2| £"}, and 
{cy | E°, c) | E} are identical, and forms suchas {¢ | £), ¢| Ex. c,) | E).j € J} and 
{c| E, UE2, c,| BE). j € J} are completely equivalent. Which form is used in any 
particular context is purely a matter of convenience. Sometimes, the interpretation 
of an option with a rather cumbersome description is clarified by an appropriate 
reformulation. For example, a = {¢,|E 9G. c2| E°NG. ¢x|G"} may be more 
compactly written as a = {a,|G.c;|G°}, with a, = {c, | E.c2| E°}. Thus, if 


a = {en | Eng AF RG) EW) ET}. ay = {eng | Emig AU) € AGH 


we shall use the composite function notation a = {a,;|F,.j € J}. In all cases, 
the ordering of the labels is irrelevant. The class A of options, or potential actions, 
will simply be referred to as the action space. 

In defining options, the assumption of a finite partition into events of € seems 
to us to correspond most closely to the structure of practical problems. However. 
an extension to admit the possibility of infinite partitions has certain mathematical 
advantages and will be fully discussed, together with other mathematical extensions, 
in Chapter 3. 

In introducing the preference binary relation <, we are not assuming that all 
pairs of options (a;,a2) € A x A can necessarily be related by <. If the relation 
can be applied, in the sense that cither a, < a) or a) < a, (or both), we say that 
a, is not preferred to a», Or a, is not preferred to a, (or both). From <, we can 
derive a number of other useful binary relations. 


Definition 2.2. (Induced binary relations). 
(i) Q) ~ ay <=> a; < Granda, < a4. 


(ii) ay <a) <= > ay < Qn and it is not true that a, < ay. 


< 
(ili) Q, 270. > A. <4). 
(iv) @) > an => a) < a). 


Definition 2.2 is to be understood as referring to any options @,. ay in A. To 
simplify the presentation we shall omit such universal quantifiers when there is no 
danger of confusion. The induced binary relations are to be interpreted to mean 
that a, is equivalent to az if and only if a, ~ a2, and a, is strictly preferred to ap if 
and only if a; > a2. Together with the interpretation of <, these suffice to describe 
all cases where pairs of options can be compared. 

We can identify individual consequences as special cases of options by writing 
c = {c|Q}, for any ¢ € C. Without introducing further notation, we shall simply 
regard c as denoting either an element of C, or the element {¢|9} of A. There 
will be no danger of any confusion arising from this identification. Thus. we shall 
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write ¢) < cy if and only if {c;|Q} < {cz |Q} and say that consequence c, is 
not preferred to consequence c,. Strictly speaking, we should introduce a new 
symbol to replace < when referring to a preference relation over C x C, since < 
is defined over A x A. In fact, this parsimonious abuse of notation creates no 
danger of confusion and we shall routinely adopt such usage in order to avoid a 
proliferation of symbols. We shal! proceed similarly with the binary relations ~ 
and < introduced in Definition 2.2. To avoid triviality, we shall later formally 
assume that there exist at least two consequences c; and c2 such that c) < co. 

The basic preference relation between options, <, conditional on the initial 
state of information Afj, can also be used to define a binary relation on E x €, 
the collection of all pairs of relevant events. This binary relation will capture the 
intuitive notion of one event being “more likely” than another. Since, once again, 
there is no danger of confusion, we shall further economise on notation and also 
use the symbol < to denote this new uncertainty binary relation between events. 


Definition 2.3. (Uncertainty relation). 
E<F <> foralle, <2. {ce2|E.c,|E°} < {e2| Fc | F}; 


we then say that E is not more likely than F, 


The intuitive content of the definition is clear. If we compare two dichotomised 
options, involving the same pair of consequences and differing only in terms of their 
uncertain events, we will prefer the option under which we feel it is “more likely” 
that the preferred consequence will obtain. Clearly, the force of this argument ap- 
plies independently of the choice of the particular consequences c; and ¢2, provided 
that our preferences between the latter are assumed independent of any considera- 
tions regarding the events & and F. 

Continuing the (convenient and harmless) abuse of notation, we shall also use 
the derived binary relations given in Definition 2.2 to describe uncertainty relations 
between events. Thus, £ ~ F if and only if £ and F are equally likely, and E > F 
if and only if & is strictly more likely than F’. Since, for all ¢ < cy, 


cr = {ce | O,e1 | Q} < {e2| Ql er | O} = ee, 


it is always true, as one would expect, that 0 < 2. 

It is worth stressing once again at this point that all the order relations over 
A x A, and hence over C x C and € x €, are to be understood as personal, in 
the sense that, given an agreed structure for a decision problem, each individual 
is free to express his or her own personal preferences, in the light of his or her 
initial state of information Afo. Thus, for a given individual, a statement such 
as EF’ > F is to be interpreted as “this individual, given the state of information 
described by My, considers event E to be more likely than event F'”. Moreover, 
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Definition 2.3 provides such a statement with an operational meaning since for all 
Cy < C2, E > F is equivalent to an agreement to choose option {c2 | E.c, | E°} in 
preference to option {c» | F,c, | F°}. 

To complete our discussion of basic ideas and definitions, we need to con- 
sider one further important topic. Throughout this section, we have stressed that 
preferences, initially defined among options but inducing binary relations among 
consequences and events. are conditional on the current state of information. The 
initial state of information. taking as an arbitrary “origin” the first occasion on 
which an individual thinks systematically about the problem. has been denoted by 
Aly. Subsequently, however, we shall need to take into account further information, 
obtained by considering the occurrence of real-world events. Given the assumed 
occurrence of a possible event G, preferences between options will be described 
by a new binary relation <;;, taking into account both the initial information J), 
and the additional information provided by G. The obvious relation between < 
and <,; is given by the following: 


Definition 2.4. (Conditional preference). For anv G > 


(i) a, <G an <=> forall a{a,|G.a|G'} < {a)|G.a{G}: 
(ii) B<g Fo == form <G ¢. {en | Ec, | E} <e {er | Fey | FC}. 


The intuitive content of the definition is clear. If we do not prefer a, to ay, 
given G, then this preference obviously carries over to any pair of options leading. 
respectively, to @, Or (2 if G occurs, and defined identically if G" occurs. Con- 
versely, comparison of options which are identical if G' occurs depends entirely on 
consideration of what happens if G occurs. Naturally. the induced binary relations 
set out in Definition 2.2 have their obvious counterparts. denoted by ~;; and <;;. 

The induced binary relation between consequences is obviously defined by 


en om {ey {Qh SG {e2 DQ}. 


However. when we come, in Section 2.3, to discuss the desirable properties of < 
and <,; we shall make formal assumptions which imply that. as one would expect. 
c, Se Cy ifand only if ¢, < co, so that preferences between pure consequences are 
not affected by additional information regarding the uncertain events in €. 

The definition of the conditional uncertainty relation <¢; isa simple translation 
of Definition 2.3 to a conditional preference setting. The conditional uncertainty 
relation <<; induced between events is of fundamental importance. This relation. 
with its derived forms ~;; and <¢;. provides the key to investigating the way in 
which uncertainties about events should be modified in the light of new informa- 
tion. Obviously. if G = Q . all conditional relations reduce to their unconditional 
counterparts. Thus, it is only when @ < G < Q that conditioning on G may yield 
new preference patterns. 
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2.3 COHERENCE AND QUANTIFICATION 
2.3.1 Events, Options and Preferences 


The formal representation of the decision-maker’s “wider frame of discourse” in- 
cludes an algebra of events €, a set of consequences C, and a set of options A, 
whose generic element has the form {c; | Ej.j € J}, where (Ej. j € J} isa finite 
partition of the certain event 2, E, € E,c, EC. j € J. The set A x Ais equipped 
with a collection of binary relations <¢, G > @, representing the notion that one 
option is not preferred to another, given the assumed occurrence of a possible event 
G. In addition, all preferences are assumed conditional on an initial state of in- 
formation, Mo, with the binary relation < (i.e.. <«)) representing the preference 
relation on A x A conditional on Afy alone. 


We now wish to make precise our assumptions about these elements of the 
formal representation of a decision problem. Bearing in mind the overall objective 
of developing a rational approach to choosing among options, our assumptions, 
presented in the form of a series of axioms, can be viewed as responses to the 
questions: “what rules should preference relations obey?” and “what events should 
be included in €?” 


Each formal axiom will be accompanied by a detailed discussion of the intu- 
itive motivation underlying it. 


It is important to recognise that the axioms we shall present are prescriptive, not 
descriptive. Thus, they do not purport to describe the ways in which individuals 
actually do behave in formulating problems or making choices, neither do they 
assert, on some presumed “ethical” basis, the ways in which individuals should 
behave. The axioms simply prescribe constraints which it seems to us imperative 
to acknowledge in those situations where an individual aspires to choose among 
alternatives in such a way as to avoid certain forms of behavioural inconsistency. 


2.3.2 Coherent Preferences 


We shall begin by assuming that problems represented within the formal framework 
are non-trivial and that we are able to compare any pair of simple dichotomised 
options. 


Axiom 1. (Comparability of consequences and dichotomised options). 
(i) There exist consequences ¢, C2 such that c, < C2. 


(ii) For all consequences c,, C2, and events E, F, 
either {co|E,c1 | E%} < {eo | Fc, | F} 
or {c2|E.c, | E*} > {c2| Fie, | Fh. 
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Discussion of Axiom 1, Condition (i) is very natural. If all consequences were 
equivalent, there would not be a decision problem in any real sense, since all choices 
would certainly lead to precisely equivalent outcomes. We have already noted that, 
in any given decision problem, C can be defined as simply the set of consequences 
required for that problem. Condition (ii) does nor therefore assert that we should 
be able to compare any pair of conceivable options, however bizarre or fantastic. 
In most practical problems. there will typically be a high degree of similarity in the 
form of the consequences (e.g. all monetary), although it is easy to think of examples 
where this form is complex (e.g. combinations of monetary, health and industrial 
relations elements). We are trying to capture the essence of what is required for 
an orderly and systematic approach to comparing alternatives of genuine interest. 
We are not. at this stage. making the direct assumption that a// options. however 
complex. can be compared. But there could be no possibility of an orderly and 
systematic approach if we were unwilling to express preferences among simple 
dichotomised options and hence (with EF = F = Q) among the consequences 
themselves. Condition (ii) is therefore to be interpreted in the following sense: “// 
we aspire to make a rational choice between alternative options, then we must at 
least be willing to express preferences between simple dichotomised options.” 


There are certainly many situations where we find the task of comparing sim- 
ple options, and even consequences. very difficult. Resource allocation among 
competing health care programmes involving different target populations and 
morbidity and mortality rates is one obvious such example. However. the diffi- 
culty of comparing options in such cases does not. of course. obviate the need 
for such comparisons if we are to aspire to responsible decision making. 


We shall now state our assumptions about the ways in which preferences 
should fit together or cohere in terms of the order relation over A x A. 


Axiom 2. (Transitivity of preferences). 
Gi) a<a, 


(ii) [fay < a2 and az < ay, then a, < ay. 


Discussion of Axiom 2. Condition (i) has obvious intuitive support. It would 
make little sense to assert that an option was strictly preferred to itself. It would 
also seem strangely perverse to claim to be unable to compare an option with 
itself! We note that, from Definition 2.2 (i), if a < a. then a ~ a. Condition (ii) 
requires preferences to be transitive. The intuitive basis for such a requirement is 
perhaps best illustrated by considering the consequences of intransitive preferences. 
Suppose, therefore. that we found ourselves expressing the preferences a, < ay. 
aa <a; and a; < a; among three options a). a2 and a3. The assertion of 
strict preference rules out equivalence between any pair of the options. so that our 
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expressed preferences reveal that we perceive some actual difference in value (no 
matter how small) between the two options in each case. Let us now examine 
the behavioural implications of these expressed preferences. If we consider, for 
example, the preference a, < a2, we are implicitly stating that there exists a “price”, 
say xr, that we would be willing to pay in order to move from a position of having 
to accept option a; to one where we have, instead, to accept option a2. Let y and 
z denote the corresponding “prices” for switching from az to a3 and from a; to 
a}, respectively. Suppose now that we are confronted with the prospect of having 
to accept option a). By virtue of the expressed preference a, < ay and the above 
discussion, we are willing to pay x in order to exchange option a, for option a2. 
But now, by virtue of the preference a2 < a3, we are willing to pay y in order 
to exchange a2 for a3. Repeating the argument once again, since a3 < a, we 
are willing to pay z in order to avoid a3 and have, instead, the prospect of option 
a,. We would thus have paid x + y + z in order to find ourselves in precisely 
the same position as we started from! What is more, we could find ourselves 
arguing through this cycle over and over again. Willingness to act on the basis 
of intransitive preferences is thus seen to be equivalent to a willingness to suffer 
unnecessarily the certain loss of something to which one attaches positive value. 
We regard this as inherently inconsistent behaviour and recall that the purpose of 
the axioms is to impose rules of coherence on preference orderings that will exclude 
the possibility of such inconsistencies. Thus, Axiom 2(ii) is to be understood in the 
following sense: “[f we aspire to avoid expressing preferences whose behavioural 
implications are such as to lead us to the certain loss of something we value, then 
we must ensure that our preferences fit together in a transitive manner.” 


Our discussion of this axiom is, of course. informal and appeals to directly 
intuitive considerations. At this stage, it would therefore be inappropriate to 
become involved in a formal discussion of terms such as “value” and “price”. It 
is intuitively clear that if we assert strict preference there must be some amount 
of money (or grains of wheat, or beads, or whatever), however small, having a 
“value” less than the perceived difference in “value” between the two options. We 
should therefore be willing to pay this amount to switch from the less preferred 
to the more preferred option. 


The following consequences of Axiom 2 are easily established and will prove 
useful in our subsequent development. 


Proposition 2.1. (Transitivity of uncertainties). 
(i) E~ BE. 
(ii) E, < Ey and EB, < EBs imply B < E3. 


Proof. This is immediate from Definition 2.3 and Axiom 2. g 
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Proposition 2.2. (Derived transitive properties). 
(i) Ifa, ~ a2 and az ~ a3 then a, ~ a3. 
If Fy ~ Eo and E, ~ E} then hy ~ Es. 

(ii) Ifa; < ay and az ~ ay then a, < a3. 

IfE, < Eyand E, ~ Ex then Ey < £3. 

Proof. Toprove (i), leta, ~ az anda, ~ a3 so that, by Definition 2.2.4) < ay. 
az < a; and az < a3,a3 < a). Then, by Axiom 2(ii), a, < a3 and a3 < ay, and 
thus a, ~ a3. A similar argument applies to events using Proposition 2.1. Again. 
part (ii) follows rather similarly. 


Axiom 3. (Consistency of preferences). 
(i) Ife, < cz then. for allG > Oc, Sq eo. 
(ii) If, for some cy < cr, {er | Eve, | EO} < feo | Bee |}. then EB < F. 
(iii) Uf, for some cand G > W, {ay |G.c|G"} < {a2 |G.e|G*}, 
then ay <c¢ a2. 


Discussion of Axiom 3. Condition (i) formalises the idea that preferences 
between pure consequences should not be affected by the acquisition of further 
information regarding the uncertain events in €. Conditions (ii) and (iil) ensure 
that Detinitions 2.3 and 2.4 have operational content. Indeed, (ii) asserts that if 
we have {c2|E.c, |E°} < {e2| Peer | FP} for some c, < cz then we should 
have this preference for any c; < cz. This formalises the intuitive idea that the 
stated preference should only depend on the “relative likelihood” of F and F’ 
and should not depend on the particular consequences used in constructing the 
options. Similarly, (iii) asserts that if we have the preference {a;|G.ciG'} < 
{a2|G.c|G"} for some c then, given G, a, should not be preferred to a». so that. 
for any a, {a,;|G.a|G°} < {a.[G.a]G*}. This latter argument is a version 
of what might be called the sure-thing principle: if two situations are such that 
whatever the outcome of the first there is a preferable corresponding outcome of 
the second, then the second situation is preferable overall. 

An important implication of Axiom 3 is that preferences between conse- 
quences are invariant under changes in the information “origin” regarding events 


in €. 


Proposition 2.3. (Invariance of preferences between consequences). 
CS ¢9 ifand only if there exist G > @ such that ¢, Se: ¢2. 


Proof. Vf cy < cs then, by Axiom 3(i), ¢, <q 2 for any event G. Conversely. 
by Definition 2.4(i), for any G > W. ¢) <c; cz implies that for any option a. one 
has {c,|G.a| Go} < {e2|G.a|G*}. Taking @ = {¢)|G.c2!G‘}. this implies 
that {c) | Geez |G} < {ey | Oe | Q}. Ife; > ce this implies, by Axiom 3(1i), that 
G < @. thus contradicting G > @. Hence. by Axiom Iii). c) <2. g 
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Another important consequence of Axiom 3 is that uncertainty orderings of 
events respect logical implications, in the sense that if E logically implies F, i-e., 
if E C F, then F cannot be considered less likely than E. 


Proposition 2.4. (Monotonicity). If E C F then E < F. 
Proof. For any c, < co, define 


a, = {c2] E.c,| E°} = {c. |F — E, {e2| Ec | E°} | (F - E)}, 
ag = {e2| Frei | F°} = {co|F — E. {c.| E,c,| E°}|(F - E)*}. 


By Axiom 3(i) withG = F - E = FN E*, a, < ag. It now follows immediately 
from Definition 2.2 thatE < F. gq 


This last result is an example of how coherent qualitative comparisons of 
uncertain events in terms of the “not more likely” relation conform to intuitive 
requirements. 

If follows from Proposition 2.4 that, as one would expect, for any event E, 
@< E <Q. We shall mostly work, however, with “significant” events, for which 
this ordering is strict. 


Definition 2.5. (Significant events). An event E is significant given G > @ 
if c) <G C2 implies that c, <g {c2|E.a | E%} <e@ co. If G = Q, we shall 
simply say that E is significant. 


Intuitively, significant events given G are those operationally perceived by 
the decision-maker as “‘practically possible but not certain” given the information 
provided by G. Thus, given G > @) and assuming c; <q C2, if E is judged to be 
significant given G, one would strictly prefer the option {co | E,c; | E°} to c, for 
sure, since it provides an additional perceived possibility of obtaining the more 
desirable consequence cz. Similarly, one would strictly prefer co for sure to the 
stated option. 


Proposition 2.5. (Characterisation of significant events). An event E is sig- 
nificant given G > @, if and only if®@ < ENG < G. In particular, E is 
significant if and only if8< E <Q. 


Proof. Using Definitions 2.4 and 2.5, if E is significant given G then, for all 
C1 Se cz and for any option a, 


{c,|G,a|G*} < {eg} ENG,a | E° NG,a|G*} < {e2|G,a}G*}. 
Taking a = c,, we have 


cq = {c2|@.e. | Q} < {ex} ENG.a |(ENG)} < {cz|G,e1 |G"} 
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and hence, by Definition 2.3,0 < ENG < G. Conversely, if0< ENG <G, 
{cp |G.c, |G} < {a lENG. og |E.NG.a |G} < {e2|G.c, |G} 


and hence, by Axiom 3(iii), ¢) <q {e2| BE.) | E°} <G co. If, in particular, G = 2 
then F is significant if and only if@< E <Q. g 


The operational essence of “learning from experience” is that a decision- 
maker's preferences may change in passing from one state of information to a 
new state brought about by the acquisition of further information regarding the 
occurrence of events in €, which leads to changes in assessments of uncertainty. 
There are, however, too many complex ways in which such changes in assessments 
can take place for us to be able to capture the idea in a simple form. On the other 
hand, the very special case in which preferences do not change is easy to describe 
in terms of the concepts thus far available to us. 


Definition 2.6. (Pairwise independence of events). 
We say that E and F are (pairwise) independent, denoted by E 1 F, if. and 
only if, for all ¢.¢,.¢2 

(i) cef{e,|E.c, | E > cep {eo| Ea | ES, 

(ii) co {co| Fi | PY > cer {e2| Foe | FS}, 


where e is any one of the relations <,~ or >. 


The definition is given for the simple situation of preferences between pure 
consequences and dichotomised options. Since by Proposition 2.3 preferences re- 
garding pure consequences are unaffected by additional information, the condition 
stated captures, in an operational form, the notion that uncertainty judgements about 
E, say, are unaffected by the additional information F’. We interpret E 1 F as 
“E is independent of F”’. An alternative characterisation will be given in Proposi- 
tion 2.13. 


2.3.3 Quantification 


The notion of preference between options. formalised by the binary relation <, 
provides a qualitative basis for comparing options and, by extension, for comparing 
consequences and events. The coherence axioms (Axioms | to 3) then provide a 
minimal set of rules to ensure that qualitative comparisons based on < cannot have 
intuitively undesirable implications. 

We shall now argue that this purely qualitative framework is inadequate for 
serious, systematic comparisons of options. An illuminating analogy can be drawn 
between < and a number of qualitative relations in common use both in an everyday 
setting and in the physical sciences. 


2.3 Coherence and Quantification 29 


Consider, for example, the relations not heavier than, not longer than, not 
hotter than. It is abundantly clear that these cannot suffice, as they stand, as an 
adequate basis for the physical sciences. Instead, we need to introduce in each case 
some form of guantification by setting up a standard unit of measurement, such 
as the kilogram, the metre, or the centigrade interval, together with an (implicitly) 
continuous scale such as arbitrary decimal fractions of a kilogram, a metre, a 
centigrade interval. This enables us to assign a numerical value, representing 
weight, length, or temperature, to any given physical or chemical entity. 

This can be achieved by carrying out, implicitly or explicitly, a series of 
qualitative pairwise comparisons of the feature of interest with appropriately chosen 
points on the standard scale. For example, in quantifying the length of a stick, we 
place one end against the origin of a metre scale and then use a series of qualitative 
comparisons, based on “not longer than” (and derived relations, such as “strictly 
longer than’’). If the stick is “not longer than” the scale mark of 2.5 metres, but is 
“strictly longer than” the scale mark of 2.4 metres, we might lazily report that the 
stick is “2.45 metres long”. If we needed to, we could continue to make qualitative 
comparisons of this kind with finer subdivisions of the scale, thus extending the 
number of decimal places in our answer. The example is, of course, a trivial one, 
but the general point is extremely important. Precision, through quantification, is 
achieved by introducing some form of numerical standard into a context already 
equipped with a coherent qualitative ordering relation. 

We shall regard it as essential to be able to aspire to some kind of quantitative 
precision in the context of comparing options. It is therefore necessary that we 
have available some form of standard options, whose definitions have close links 
with an easily understood numerical scale, and which will play a role analogous to 
the standard metre or standard kilogram. As a first step towards this, we make the 
following assumption about the algebra of events, €. 


Axiom 4. (Existence of standard events). There exists a subalgebra S of € 
and a function jt: S — {0,1} such that: 
(i) S, < So if, and only if, u(S1) < pe(S2); 
(ii) S} A Sy = @ implies that (S$, U So) = u(S,) + u(S2); 
(iii) for any number a in [0,1], and events E, F, there is a standard event S 
such that u(S) =a, E 1 Sand F 1 S; 

(iv) S, L S2 implies that u(S,; 9 S2) = 2(S1)p(S2). 

(VY) fELS,FP LSandE 1 F,thnE~S>E~r S. 

Discussion of Axiom 4. A family of events satisfying conditions (i) and (ii) 
is easily identified by imagining an idealised roulette wheel of unit circumference. 
We suppose that no point on the circumference is “favoured” as a resting place for 


the ball (considered as a point) in the sense that given any c;, cz and events S;, S» 
corresponding to the ball landing within specified connected arcs, or finite unions 


30 2 Foundations 


and intersections of such arcs. {c; | $,.c2 | Si} and {c | Se. cz | 55} are considered 
equivalent if and only if 4(5,) = u(S2), where jz is the function mapping the “arc- 
event” to its total length. Conditions (i) and (ii) are then intuitively obvious, as is the 
fact, in (iii), that for any a € [0. 1] we can construct an S' with 4«(.S) = a. Note that 
S is required to be an algebra and thus both @ and 22 are standard events. It follows 
from Proposition 2.4 and Axiom 4(i) that (0) = 0 and p(Q) = 1. The remainder 
of (iii) is intuitively obvious: we note first that the basic idea of an idealised roulette 
wheel does assume that each “play” on such a wheel is “independent”, in the sense 
of Definition 2.6, of any other events, including previous “plays” on the same wheel. 
Thus, for any events £, F in €, we can always think of an “independent” play which 
generates independent events S in S with 4(S) = a for any specified a in [0. 1}. In 
this extended setting, if we think of the circumferences for two independent plays 
as unravelled to form the sides of a unit square, with ;: mapping events to the areas 
they define, condition (iv) is clearly satisfied. Finally, (v) encapsulates an obviously 
desirable consequence of independence; namely. that if E is independent of F and 
S,and F is independent of S, a judgement of equivalence between EF and S should 
not be affected by the occurrence of F’. 

We will refer to S as a stundard family of events in € and will think of & as 
the algebra generated by the relevant events in the decision problem together with 
the elements of S. Other forms of standard family satisfying (i) to (v) are easily 
imagined. For example, it is obvious that a roulette wheel of unit circumference 
could be imagined cut at some point and “unravelled” to form a unit interval. The 
underlying image would then be that of a point landing in the unit interval and an 
event S such that (S) = p would denote a subinterval of length p; alternatively. 
we could imagine a point landing in the unit square. with S denoting a region of 
area p. The obvious intuitive content of conditions (i) to (v) can clearly be similarly 
motivated in these cases. the discussion for the unit interval being virtually identical 
to that given for the roulette wheel. It is important to emphasise that we do not 
require the assumption that standard families of events actually. physically exist. 
or could be precisely constructed in accordance with conditions (i) to (v). We only 
require that we can invoke such a set up as a mental image. 

There is, of course. an element of mathematical idealisation involved in think- 
ing about aff p © [0.1], rather than. for example, some subset of the rationals. 
corresponding to binary expansions consisting of zeros from some specified lo- 
cation onwards, reflecting the inherent limits of accuracy in any actual procedure 
for determining are lengths or areas. The same is true, however. of alf scientific 
discourse in which measurements are taken. in principle, to be real numbers. rather 
than a subset of the rationals chosen to reflect the limits of accuracy in the physical 
measurement procedure being employed. Our argument for accepting this degrec 
of mathematical idealisation in setting up our formal system is the same as would 
apply in the physical sciences. Namely. that no serious conceptual distortion is 
introduced. while many irrelevant technical difficulties are avoided: in particular. 
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those conceming the non-closure of a set of numbers with respect to operations of 
interest. This argument is not universally accepted, however, and further, related 
discussion of the issue is provided in Section 2.8. 

Our view is that, from the perspective of the foundations of decision-making, 
the step from the finite to the infinite implicit in making use of real numbers is 
simply a pragmatic convenience, whereas the step from comparing a finite set of 
possibilities to comparing an infinite set has more substantive implications. We 
have emphasised this latter point by postponing infinite extensions of the decision 
framework until Chapter 3. 


Proposition 2.6. (Collections of disjoint standard events). 


For any finite collection {a,....,Q,} of real numbers such that a; > 0 and 
Qa, +°++ + Qy < 1 there exists a corresponding collection {S,,...;Sn} of 
disjoint standard events such that (S;) = a;,i = 1,....7. 


Proof. By Axiom 4(iii) there exists S, such that 1(S,) = a. For) <j <n, 
suppose inductively that S,,...,.Sj-) are disjoint, Bj; = S,U---US,_, and define 
B; =a, +-+-+a).) = u(B;). By Axiom 4 (iii, iv), there exists T, in S such that 
u(B; OT;) = «{Bj){a,/(1 — 3))}. Define S; = T; BY, so that $,N S; = 9, 
i= 1,...,j- 1. Then, T; = S$, U(T, 9 Bj) and hence, using Axiom 4(ii), 
u(T;) = w(S;) + w(T, B,). Thus, 2(S;) = aj/(1 — 3) — a,3)/(1 — 3)) = a; 
and the result follows. g 


Axiom 5. (Precise measurement of preferences and uncertainties). 


(i) Ifcy <¢ < ea, there exists a standard event S such that 
om {e9| S,o.] St}. 
(ii) For each event E, there exists a standard event S such that E ~ S. 


Discussion of Axiom 5. In the introduction to this section, we discussed the 
idea of precision through quantification and pointed out, using analogies with other 
measurement systems such as weight, length and temperature, that the process is 
based on successive comparisons with a standard. Let S, denote a standard event 
such that j(.S,) = q. We start with the obvious preferences, {cz | So.c1 | S6} <¢ < 
{¢2|S).¢1 | ST}. for any c, < ¢ < cy, and then begin to explore comparisons with 
standard options based on S,, S, withO < r < y < 1. In this way, by gradually 
increasing .c away from 0 and decreasing y away from 1, we arrive at comparisons 
suchas {c2|S,,e | ST} << {c2|S,.c | $0}, with the difference y—.r becoming 
increasingly small. Intuitively, as we increase x. {c2[S,.¢) |.S¢} becomes more 
and more “attractive” as an option, and as we decrease y, {c2 | S,..c; | S7} becomes 
less “attractive”. Any given consequence c, such that ¢, < ¢ < cy, can therefore be 
“sandwiched” arbitrarily tightly and, in the limit, be judged equivalent to one of the 
standard options defined in terms of ¢), c2. The essence of Axiom 5(i) is that we 
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can proceed to a common limit, «, say, approached from below by the successive 
values of x and above by the successive values of 1. The standard family of options 
is thus assumed to provide a continuous scale against which any consequence can 
be precisely compared. 

Condition (ii) extends the idea of precise comparison to include the assumption 
that, for any event F and for all consequences «1, ¢2 such that ¢; < ¢2. the option 
{e, | E.c, | E°} can be compared precisely with the family of standard options 
{cy ]S,.e, | S¢}..2 € [0.1], defined by c) and «2. The underlying idea is similar 
to that motivating condition (i). Indeed, given the intuitive content of the relation 
“not more likely than”. we can begin with the obvious ordering {¢2 | Su.) | Sy} < 
{ra | Eve, | EO} < {eo (Si. | Sy} for any event £. and then consider refinements 
of this of the form {r2 | S).¢) |S} < {e2 | Beer [EY < {02 | 9,1 | SU}. with 
increasing gradually from 0, y decreasing gradually from |. and y — .r becoming 
increasingly small. so that, in terms of the ordering of the events. 5, < EF < S,,. 
Again. the essence of the axiom is that this “sandwiching” can be refined arbitrarily 
closely by an increasing sequence of .r’s and a decreasing sequence of 1's tending 
to a common limit. 

The preceding argument certainly again involves an element of mathematical 
idealisation. In practice, there might, in fact. be some interval of indifference. in 
the sense that we judge {2 |S,.¢, | 50} <¢ < {e2|5,.¢; | S)} for some (possibly 
rational) w and y but feel unable to express a more precise form of preference. This 
is analogous to the situation where a physical measuring instrument has inherent 
limits. enabling one to conclude that a reading is in the range 3.126 to 3.135, say, 
but not permitting a more precise statement, In this case. we would typically report 
the measurement to be 3.13 and proceed as if this were a precise Measurement. We 
formulate the theory on the prescriptive assumption that we aspire to exact mea- 
surement (exact comparisons in our case), whilst acknowledging that. in practice. 
we have to make do with the best level of precision currently available (or devote 
some resources to improving our measuring instruments!). 


In the context of measuring beliefs. several authors have suggested that this 
imprecision be formally incorporated into the axiom system. For many applica- 
tions, this would seem to be an unnecessary confusion of the prescriptive and 
the descriptive. Every physicist or chemist knows that there are inherent limits 
of accuracy in any given laboratory context but. so far as we know, no one has 
suggested developing the structures of theoretical physics or chemistry on the 
assumption that quantities appearing in fundamental equations should be con- 
strained to take values in some subset of the rationals. However. it may well be 
that there are situations where imprecision in the context of comparing conse- 
quences is too basic and problematic a feature to be adequately dealt with by an 
approach based on theoretical precision. tempered with pragmatically acknowl- 
edged approximation. We shall return to this issue in Section 2.8. 
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The particular standard option to which c is judged equivalent will, of course, 
depend on c, but we have implicitly assumed that it does nor depend on any in- 
formation we might have concerning the occurrence of real-world events. Indeed, 
Proposition 2.3 implies that our “attitudes” or “values” regarding consequences are 
fixed throughout the analysis of any particular decision problem. It is intuitively ob- 
vious that, if the time-scale on which values change were not rather long compared 
with the time-scale within which individual problems are analysed, there would be 
little hope for rational analysis of any kind. 


2.4 BELIEFS AND PROBABILITIES 
2.4.1. Representation of Beliefs 


It is clear that an individual’s preferences among options in any decision problem 
should depend, at least in part, on the “degrees of belief” which that individual 
attaches to the uncertain events forming part of the definitions of the options. 

The principles of coherence and quantification by comparison with a standard, 
expressed in axiomatic form in the previous section, will enable us to give a formal 
definition of degree of belief, thus providing a numerical measure of the uncertainty 
attached to each event. 


The conceptual basis for this numerical measure will be seen to derive from 
the formal rules governing quantitative, coherent preferences, irrespective of the 
nature of the uncertain events under consideration. This is in vivid contrast to what 
are sometimes called the classical and frequency approaches to defining numerical 
measures of uncertainty (see Section 2.8), where the existence of symmetries and 
the possibility of indefinite replication, respectively, play fundamental roles in 
defining the concepts for restricted classes of events. 


We cannot emphasise strongly enough the important distinction between defin- 
ing a general concept and evaluating a particular case. Our definition will depend 
only on the logical notions of quantitative, coherent preferences; our practical eval- 
uations will often make use of perceived symmetries and observed frequencies. 

We begin by establishing some basic results concerning the uncertainty relation 
between events. 


Proposition 2.7. (Complete comparability of events). 
Either E, > Eo, or Ey ~ Eo, or Ey > EB). 


Proof. By Axiom S(ii), there exist S, and S such that FE, ~ S; and Ey ~ So; 
the complete ordering now follows from Axiom 4(i) and Proposition 2.1. g 
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We see from Proposition 2.7 that, although the order relation < between 
options was not assumed to be complete (i.e., not al! pairs of options were assumed 
to be comparable), it turns out, as a consequence of Axiom 5 (the axiom of precise 
measurement), that the uncertainty relation induced between events is complete. 
A similar result concerning the comparability of all options will be established in 
Section 2.5. 


Proposition 2.8. (Additivity of uncertainty relations). If A < B.C < D 
and ANC = BOND =, then AUC < BUD. Moreover, if A < Bor 
C < D,thenAUC < BUD. 


Proof. We first show that, for any G,if ANG = BNG = OthenA < B <= 
AUG < BUG. Forany c > ¢;, ANG = BONG = @, define: 


a, = {c9| A,c, | AS} = fe, |G, {eo | Ac [ATF |G} 

ay = {e2|B.c, | B°} = {ce |G. {ez | Bee, | BG} 

ag = {c2| AUG.¢ |(AUG)"} = {e2|G. {ey | Aves | A°} G2) 
a4 = {| BUG, c,|(BUG)'} = {2 |G. {e2| Bie | BY |G}. 


Then, by Definition 2.3, A < B => a, < ay; by Axiom 3,a; < a2 => 
a3 < ay; and using again Definition 2.3,a; <a, == AUG < BUG. Thus, 


AU(C- B)< BU(C- B)=BUC =CU(B-C)< DU(B-C). 
AUC =AU(C- B)U(CNB) < DU(B-C)U(CNB)=BuUD, 
The final statement follows from essentially the same argument. gq 


We now make the key definition which enables us to move to a quantitative 
notion of degree of belief. 


Definition 2.7. (Measure of degree of belief). Given an uncertainty relation 
<, the probability P(E) of an event E is the real number j(S) associated 
with any standard event S such that E ~ S, 


This definition provides a natural, operational extension of the qualitative 
uncertainty relation encapsulated in Definition 2.3, by linking the equivalence of 
any E € Etosome S € S and exploiting the fact that the nature of the construction 
of S provides a direct obvious quantification of the uncertainty regarding S. 

With our operational definition, the meaning of a probability statement is clear. 
For instance, the statement P(E) = 0.5 precisely means that E' ts judged to be 
equally likely as a standard event of ‘measure’ 0.5, maybe a conceptual perfect coin 
falling heads, or a computer generated ‘random’ integer being an odd number. 
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It should be emphasised that, according to Definition 2.7, probabilities are 
always personal degrees of belief, in that they are a numerical representation of 
the decision-maker’s personal uncertainty relation < between events. Moreover, 
probabilities are always conditional on the information currently available. It makes 
no sense, within the framework we are discussing, to qualify the word probability 
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with adjectives such as “objective”, “correct” or “unconditional”. 


Since probabilities are obviously conditional on the initial state of information 
Mj, a more precise and revealing notation in Definition 2.7 would have been 
P(E | Mp). In order to avoid cumbersome notation, we shall stick to the shorter 
version, but the implicit conditioning on Mf, should always be bome in mind. 


Proposition 2.9. (Existence and uniqueness). Given an uncertainty relation 
<, there exists a unique probability P(E) associated with each event E. 


Proof. Existence follows from Axiom 5(ii). For uniqueness, if EF ~ S, and 
E ~ Spo then by Proposition 2.2(ii), S; ~ S2. The result now follows from 
Axiom 4(i). g 


Definition 2.8. (Compatibility). A function f : E — ® is said to be compat- 
ible with an order relation < on E x E if, for all events, 


E<F <= f(E)< f(F). 


Proposition 2.10. (Compatibility of probability and degrees of belief ). 
The probability function P{.) is compatible with the uncertainty relation <. 


Proof. By Axiom S(ii) there exist standard events S; and S2 such that E ~ S, 
and F ~ Sy. Then, by Proposition 2.2(ii) , E < F iff S; < S»_ and hence, by 
Axiom 4(i), iff 4($,) < ($2). The result follows from Definition 2.7. g 


The following proposition is of fundamental importance. It establishes that 
coherent, quantitative degrees of belief have the structure of a finitely additive prob- 
ability measure over €. Moreover, it establishes that significant events, i.e., events 
which are “practically possible but not certain”, should be assigned probability 
values in the open interval (0, 1). 


Proposition 2.11. (Probability structure of degrees of belief). 
(i) P(@) = 0 and P(Q) = 1. 

(ii) FEO F = 9, then P(EU F) = P(E) + P(F). 

(iii) E is significant if, and only if,0 < P(E) <1. 
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Proof. (i) By Definition 2.7,0 < P(E) < 1. Moreover, by Axiom 4(iii) there 
exist S, and S* such that u(S,) = 0 and y:(S*) = 1. By Proposition 2.4, 8 < S. 
and, by Proposition 2.10 P(@) < 0; hence, P(@) = 0; similarly, S* < Q implies 
that P(Q) = 1. 

(ii) If £ = Oor F = 0, or both, the result is trivially true. If FE > @and F > 0, 
then, by Proposition 2.8, EU F > E; thus, ifa = P(£) and 3 = P(E UF), we 
have a < {3 and, by Proposition 2.6, there exist events S,, S2 such that S) NS. = @, 
P(S,) = aand P(S2) = j3—a. By Proposition 2.7, F > S,or F ~ SyorF < Sy. 
If F > Sy, then, by Proposition 2.8. EU F > S, US: and hence P(E UF) > 3, 
which is impossible; similarly, if F < S» then EUF < S,;US: and P(EUF) < 3 
which, again, is impossible. Hence, F ~ S» and therefore P(F) = 3 — a. so that 
P(EUF) = P(E) + P(F), as stated. 

(iii) By Proposition 2.5, E is significant iff @ < E < Q. The result then 
follows immediately from Proposition 2.10. g 


Corollary. (Finitely additive structure of degrees of belief). 
(i) IP{E,.j € J} is a finite collection of disjoint events, then 


P( U é,) = 2 P(E,). 


ged 
(ii) For any event E, P(E°) = 1— P(E). 


Proof. The first part follows by induction from Proposition 2.1 | (iii); the sec- 
ond part is a special case of (i) since if U;E; = { then, by Proposition 2.11(i), 
x; P(E;) = 1; ra 


Proposition 2.11 is crucial. It establishes formally that coherent, quantitative 
measures of uncertainty about events must take the form of probabilities, therefore 
justifying the nomenclature adopted in Definition 2.6 for this measure of degree of 
belief. In short, coherent degrees of belief are probabilities. 

It will often be convenient for us to use probability terminology, without 
explicit reference to the fact that the mathematical structure is merely serving as a 
representation of (personal) degrees of belief. The latter fact should, however, be 
constantly bome in mind. 


Definition 2.9. (Probability distribution). If{E,.j € J} forma finite parti- 
tion of 2, with P(E;) = p,.j € J, then {p,.j € J} ts said to be a probability 
distribution over the partition. 
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This terminology will prove useful in later discussions. The idea is that total 
belief (in 2, having measure 1) is distributed among the events of the partition, 
{E;.j € J}, according to the relative degrees of belief {p;.j € J}, with LE jp; = 
Lu, P(E;) = 1. 

Starting from the qualitative ordering among events, we have derived a quanti- 
tative measure, P(.) = P(.| fo), over € and shown that, expressed in conventional 
mathematical terminology, it has the form of a finitely additive probability measure, 
compatible with the qualitative ordering <. We now establish that this is the only 
probability measure over € compatible with <. 


Proposition 2.12. (Uniqueness of the probability measure). P is the only 
probability measure compatible with the uncertainty relation <. 


Proof. If P’ were another compatible measure, then by Proposition 2.8 we 
would always have P(E) < P’(F) <= P(E) < P(F); hence, there exists 
a monotonic function f of (0, 1] into itself such that P'(E) = f{P(E)}. By 
Proposition 2.6, for all non-negative c, ;3 such that a + 3 < 1, there exist disjoint 
standard events S, and Sy, such that P(S,) = a@ and P(S2) = 3. Hence, by 
Axiom 4(ii), f(a + 3) = P'(S,; US) = P(S\) + P'(S2) = fla) + f(P) 
and so (Eichhorn, 1978, Theorem 2.63), f(a) = ka for all a in [0,1]. But, by 
Proposition 2.9, P’({2) = 1 and hence, k = 1, so that we have P'(E) = P(E) for 
all E. dq 


We shall now establish that our operational definition of (pairwise) indepen- 
dence of events is compatible with its more standard, ad hoc, product definition. 


Proposition 2.13. (Characterisation of independence). 
ELF = P(E F) = P(E)P(F). 


Proof. Suppose E 1 F. By Axiom 4(iii), there exists S, such that P(S,) = 
P(E), E 1 S, and F 1 S;. Hence, by Axiom 4(v), E ~» Sj, so that, for any 
consequences ¢; < C2, and any option a, 


{@|EN Fy |E°O Fia|F'} ~ {e2|S,0 F.o, | STO Fal FY. 
Taking a = c,, we have 
{e2) EN Fra (EN F)"} ~ {e2 | S19 Fre | ($19 F)*}, 


so that EN F ~ S$, F. Again by Axiom (iii), given F’, 5), there exists Sg such 
that P(S,) = P(F), F 1 S, and S, 1 S2. Hence, by an identical argument to the 
above, and noting from Definition 2.6 the symmetry of 1, we have 


S;OF ~ SS, Sp». 
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By Propositions 2.1, 2.10, and Axiom 4(iv). 
P(E) F) = P(S) 9 S2) = P(S1)P(S2). 


and hence P(E N F) = P(E)P(F). 


Suppose P(E N F) = P(E)P(F). By Axiom 4(iii), there exists S such that 
P(S) = P(F)and F 1 S.E 1 S. Hence, by the first part of the proof, 


P(ENS) = P(E)P(S) = P(E)P(F) = P(ENF). 


so that EN F ~ EMS. Now suppose, without loss of generality. that « < 
{cy| E.c, | E°}. Then, by Definition 2.6, 


{e] Scr | S°} < fel EAS. |(ENS)}. 
But {c| S.cr| SS} ~ {c| Fe, | FO} and 
{e/ENS.c |(ENS)} ~ {ml EN F.ag |(E FY}: 
hence by Proposition 2.2, 
{e| Fie | F"} < fe | EOF. (EO F)'}. 


sothate <p {cy|E.c, | E°}. A similar argument can obviously be given reversing 
the roles of E and F’, hence establishing thatE’ 1 FF. g 


2.4.2 Revision of Beliefs and Bayes’ Theorem 


The assumed occurrence of a real-world event will typically modify preferences 
between options by modifying the degrees of belief attached, by an individual. to the 
events defining the options. In this section. we use the assumptions of Section 2.3 
in order to identify the precise way in which coherent modification of initial beliefs 
should proceed. 

The starting point for analysing Order relations between events, given the as- 
sumed occurrence of a possible event G, ts the uncertainty relation <,,; defined 
between events. Given the assumed occurrence of G' > . the ordering < between 
acts is replaced by <;;. Analogues of Propositions 2.1 and 2.2 are trivially estab- 
lished and we recall (Proposition 2.3) that, for any G > @.e2 < ¢) iffe2 Se; ey. 


Proposition 2.14. (Properties of conditional beliefs). 
() b<@e FF ee ENG SEONG. 


(ii) Uf there exist cy < cy such that {er| Bees (EY <e {eo | Foes | Ff, 
then E <¢; F. 
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Proof. By Definition 2.4 and Proposition 2.3, E <¢ F iff, for all cg > ¢,, 
{ce | E,a | E*} <¢ {ea | Pra | F*}, 
i.e., if, and only if, for all a, 
{eg)| EN G,c| ESN G,a}|G*} < {e2| FNG,c | F°NG,al|G*}. 
Taking a = cy. 
E<oF = {a|ENG.a |(ENG)} < {e|FAG,¢ |(FNG)‘}, 
and this is true iff ENG < FONG. 
Moreover, if there exist cp > c; such that {co | E,c, | E°} <¢ {e2| Fra | F°} 
then, by Definition 2.4, with a = cy, 
{ea |ENG,a|E°SNG,¢ |G*} < {ceo | F NG, cy | FENG, e |G*}, 
so that 
{e2|ENG,a |(ENG)} < {e|FNG,a |(FAG)} 
and the result follows from Axiom 3(ii) and part (i) of this proposition. g 


Definition 2.10. (Conditional measure of degree of belief). Given a condi- 
tional uncertainty relation <¢,G > , the conditional probability P(E |G) 
of an event E given the assumed occurrence of G is the real number 1(S) 
such that E ~g S, where S is an standard event independent of G. 


Generalising the idea encapsulated in Definition 2.7, P(E|G) provides a 
quantitative operational measure of the uncertainty attached to E given the assumed 
occurrence of the event G. The following fundamental result provides the key to 
the process of revising beliefs in a coherent manner in the light of new information. 
It relates the conditional measure of degree of belief P(. | G) to the initial measure 
of degree of belief P(.). 


We have, of course. already stressed that a// degrees of belief are conditional. 
The intention of the terminology used above is to emphasise the additional con- 
ditioning resulting from the occurrence of G; the initial state of information, 
Afp, is always present as a conditioning factor, although omitted throughout for 
notational convenience. 


Proposition 2.15. (Conditional probability). For any G > 9, 
P(ENG) 
Proof. By Axiom 4(iii) and Proposition 2.13, there exists S.LG such that 
u(S) = P(E NG)/P(G). By Proposition 2.13, 
P(SOG) = P(S)P(G) = p(S)P(G) = P(ENG). 
Thus, by Proposition 2.10, SMG ~ ENG and, by Proposition 2.14, S ~g E. 
Thus, by Definition 2.10, P(E |G) = u(S) = P(ENG)/P(G). a 
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Note that, in our formulation, P(E |G) = P(E 1 G)/P(G) is a logical 
derivation from the axioms, nor an ad hoc definition. In fact. this is the simplest 
version of Bayes’ theorem. An extended form is given later in Proposition 2.19. 


Proposition 2.16. (Compatibility of conditional probability and conditional 
degrees of belief). 
E<y; F ==> P(E|G)< P(F IC). 
Proof. By Proposition 2.14(i), Ef <q F iff BOG < FG, which, by 
Proposition 2.10, holds if and only if POE NG) < P(£ OG): the result now 
follows from Proposition 2.15. 


We now extend Proposition 2.11 to degrees of belief conditional on the occur- 
rence of significant events. 


Proposition 2.17. (Probability structure of conditional degrees of belief ). 
For any event G > @, 
(i) P(O|G) =0< P(E|G) < P(Q|G)=1; 
(Gi) fENFOG =0, then P(E U FIG) = P(E\G)+ P(E |G): 
(iii) Eis significant givenG <== 0 < P(EIG) <1. 


Proof. By Proposition 2.15. P(E |G) > 0 and P(O|G) = 0: moreover, 
since ENG < G, Proposition 2.10 implies that P(E G) < P(G), so that, 
by Proposition 2.15, P(E |G) < 1. Finally, Q9G = G, so that, using again 
Proposition 2.15. P(Q|G) = L. 

By Proposition 2.15. 

P((ENG)U(FNOG)) 
P(G) 
P(ENG) P(FNG) . 
= oe fF = P(E |G) 4+ P(F IG). 

Finally, by Proposition 2.5, E is significant given G iff @< ENG < G. 
Thus, by Proposition 2.10, E is significant given G iff 0 < P(E NG) < P(G). 
The result follows from Proposition 2.15. g 


P(EUF[G)= 


Corollary. (Finitely additive structure of conditional degrees of belief ). 
For allG > 9, 


() ff {E;AG.j € J} is a finite collection of disjoint events, then 


P( UE, c) => P(EAG): 
ped ped 


(ii) for any event E, P(ES |G) = 1 ~— P(E |G). 


Proof. This parallels the proof of the Corollary to Proposition 2.11. 
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Proposition 2.18. (Uniqueness of the conditional probability measure). 
P(.|G) is the only probability measure compatible with the conditional un- 
certainty relation <q. 


Proof. This parallels the proof of Proposition 2.12. g 


Example 2.1. (Simpson’s paradox). The following example provides an instructive 
illustration of the way in which the formalism of conditional probabilities provides a coherent 
resolution of an otherwise seemingly paradoxical situation. 

Suppose that the results of a clinical trial involving 800 sick patients are as shown in 
Table 2.1, where T, T° denote, respectively, that patients did or did not receive a certain 
treatment, and R, R° denote, respectively, that the patients did or did not recover. 


Table 2.1 Trial results for all patients 


R  R Total Recovery rate 


T 200 200 400 50% 
T 160 240 400 40% 


Intuitively, it seems clear that the treatment is beneficial, and were one to base proba- 
bility judgements on these reported figures, it would seem reasonable to specify 


P(R|T)=0.5. P(R|T) =04, 


where recovery and the receipt of treatment by individuals are now represented, in an obvious 
notation, as events. Suppose now, however, that one became aware of the trial outcomes for 
male and female patients separately, and that these have the summary forms described in 
Tables 2.2 and 2.3. 


Table 2.2 Trial results for male patients 


R R° Total Recovery rate 


T 180 120 300 60% 
T 70 30 100 70% 


The results surely seem paradoxical. Tables 2.2 and 2.3 tell us that the treatment is 
neither beneficial for males nor for females; but Table 2.1 tells us that overall it is beneficial! 
How are we to come to a coherent view in the light of this apparently conflicting evidence? 
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Table 2.3 Trial resuits for female patients 


R  R Total Recovery rate 


T 20 80 100 20% 
T 90 210 300 30% 


The seeming paradox is easily resolved by an appeal to the logic of probability which. 
after all, we have just demonstrated to be the prerequisite for the coherent treatment of 
uncertainty. With Af, Af‘ denoting, respectively, the events that a patient is either male or 
female. were one to base probability judgements on the figures reported in Tables 2.2 and 
2.3, it would seem reasonable to specify 


P(R|MOT)=06. = P(R|ALNT') =0.7. 


P(RI AM AT) = 0.2. P(RI AAT) = 0.3. 


To see that these judgements do indeed cohere with those based on Table 2.1. we note, from 
the Corollary to Proposition 2.11, Proposition 2.15 and the Corollary to Proposition 2.17. 
that 

P(R|T) = P(R[MOT)P(M|T) 4+ PIRI Mo AT)P(M \T) 


P(R|T) = PIRI MAT )P(M|T) + P(R[AL AT )P(AL {T'). 
where 
P(M|T) = 0.75. P(AL|T’) = 0,25. 


The probability formalism reveals that the seeming paradox has arisen from the confounding 
of sex with treatment as a consequence of the unbalanced trial design. See Simpson (1951). 
Blyth (1972, 1973) and Lindley and Novick (1981) for further discussion. 


Proposition 2.19. (Bayes’ theorem). 
For any finite partition {E;.j € J} of QandG > 0, 


P(G| EV) P(E) 


PENG) S'S" PG EPG, 


Proof. By Proposition 2.15, 


P(E,NG) _ P(G\E,)P(Ei) © 


OMEN page PG) 


The result now follows from the Corollary to Proposition 2.11 when applied to 
G=U)/(GNE)). <4 


2.4 Beliefs and Probabilities 43 


Bayes’ theorem is a simple mathematical consequence of the fact that quanti- 
tative coherence implies that degrees of belief should obey the rules of probability. 
From another point of view, it may also be established (Zellner, 1988b) that, under 
some reasonable desiderata, Bayes’ theorem is an optimal information processing 
system. 

Since the {E,, 7 € J} forma partition and hence, by the Corollary to Propo- 
sition 2.17, yj P(E; |G) = 1, Bayes’ theorem may be written in the form 


P(E; |G) x P(G|E,)P(E)). je J, 


since the missing proportionality constant is [P(G)]~! = [£)P(G| E;)P(E;)]"'. 
and thus it is always possible to normalise the products by dividing by their sum. 
This form of the theorem is often very useful in applications. 

Bayes’ theorem acquires a particular significance in the case where the uncer- 
tain events {E,,j € J} correspond to an exclusive and exhaustive set of hypotheses 
about some aspect of the world (forexample, in a medical context, the set of possible 
diseases from which a patient may be suffering) and the event G corresponds to a 
relevant piece of evidence, or data (for example, the outcome of aclinical test). If we 
adopt the more suggestive notation, E; = H,,j € J,G = D,and, as usual, we omit 
explicit notational reference to the initial state of information Af, Proposition 2.17 
leads to Bayes’ theorem in the form P(H, | D) = P(D|H;)P(H))/P(D).j € J, 
where P(D) = ©, P(D|H,)P(H,;), characterizing the way in which initial be- 
liefs about the hypotheses, P(H;), 7 € J, are modified by the data, D, into a 
revised set of beliefs, P(H;|D), 7 € J. This process is seen to depend crucially 
on the specification of the quantities P(D| H;), j € J, which reflect how beliefs 
about obtaining the given data, D, vary over the different underlying hypotheses, 
thus defining the “relative likelihoods” of the latter. The four elements, P(H;). 
P(D|H;), P(H;|D) and P(D), occur, in various guises, throughout Bayesian 
statistics and it is convenient to have a standard terminology available. 


Definition 2.11. (Prior, posterior, and predictive probabilities). 
If {Hj,j € J} are exclusive and exhaustive events (hypotheses), then for any 
event (data) D, 

(i) P( Hj). 7 € J. are called the prior probabilities of the H,, j € J; 

(ii) P(D|H;), 9 € J, are called the likelihoods of the H,, j € J, given D; 
(iii) P(H; | D), 7 € J, are called the posterior probabilities of the H,, j € J: 
(iv) P(D) is called the predictive probability of D implied by the likelihoods 

and the prior probabilities. 


It is important to realise that the terms “prior” and “posterior” only have 
significance given an initial state of information and relative to an additional piece of 
information. Thus, P(H;), which could be more properly be written as P(H; | Aty), 
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represents beliefs prior to conditioning on data D, but posterior to conditioning 
on whatever history led to the state of information described by Afy. Similarly. 
P(H;|D), or, more properly, P(H; | Afo A D), represents beliefs posterior to 
conditioning on fy and D, but prior to conditioning on any further data which 
may be obtained subsequent to D. 

The predictive probability P({D), logically implied by the likelihoods and the 
prior probabilities, provides a basis for assessing the compatibility of the data D 
with our beliefs (see Box, 1980). We shall consider this in more detail in Chapter 6. 


Example 2.2. (Medical diagnosis). in simple problems of medical diagnosis. Bayes’ 
theorem often provides a particularly illuminating form of analysis of the various uncertain- 
ties involved. For simplicity, let us consider the situation where a patient may be characterised 
as belonging either to state H), or to state H», representing the presence or absence. respec- 
tively, of a specified disease. Let us further suppose that (4) represents the prevalence 
rate of the disease in the population to which the patient is assumed to belong, and that further 
information is available in the form of the result of a single clinical test. whose outcome is 
either positive (suggesting the presence of the disease and denoted by D = T°), or negative 
(suggesting the absence of the disease and denoted by D = T°). 


1.0 
0.6 
0.2 
P(H)) 
0.2 0.6 1.0 


Figure 2.2, P(H,|T) and P(H,|T') as functions of P(H,) 


The quantities P(T} H,) and P(T" | Hz) represent the rrue positive and wue negative 
rates of the clinical test (often referred to as the rest sensitivity and test specificity, respec- 
tively) and the systematic usc of Bayes” theorem then enables us to understand the manner in 
which these characteristics of the test combine with the prevalence rate to produce varying 
degrees of diagnostic discriminatory power. In particular. for a given clinical test of known 
sensitivity and specificity. we can investigate the range of underlying prevalence rates for 
which the test has worthwhile diagnostic value. 
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As an illustration of this process, let us consider the assessment of the diagnostic value 
of stress thallium-201 scintigraphy, a technique involving analysis of Gamma camera image 
data as an indicator of coronary heart disease. On the basis of a controlled experimental 
study, Murray er a/. (1981) concluded that P(T | H,) = 0.900, P(T | H2) = 0.875 were 
reasonable orders of magnitude for the sensitivity and specificity of the test. 

Insight into the diagnostic value of the test can be obtained by plotting values of 
P(H,|T), P(A, | T°) against P(,). where 


_ PDIP) 
PHD) = SOPH) + P(D| Ha) PH) 


for D = T or D = T", as shown in Figure 2.2. 

Asa single, overall measure of the discriminatory power of the test, one may consider 
the difference P(H,|T) — P(H,|T*). In cases where P(H,) has very low or very high 
values (e.g. for large population screening or following individual patient referral on the basis 
of suspected coronary disease, respectively), there is limited diagnostic value in the test. 
However, in clinical situations where there is considerable uncertainty about the presence 
of coronary heart disease, for example, 0.25 < P(H,) < 0.75. the test may be expected to 
provide valuable diagnostic information. 


One further point about the terms prior and posterior is worth emphasising. 
They are not necessarily to be interpreted ina chronological sense, with the assump- 
tion that “prior” beliefs are specified first and then later modified into “posterior” 
beliefs. Propositions 2.15 and 2.17 do not involve any such chronological notions. 
They merely indicate that, for coherence, specifications of degrees of belief must 
satisfy the given relationships. Thus, for example, in Proposition 2.15 one might 
first specify P(G) and P(E |G) and then use the relationship stated in the theo- 
rem to arrive at coherent specification of P(E’. G). In any given situation, the 
particular order in which we specify degrees of belief and check their coherence is 
a pragmatic one; thus, some assessments seem straightforward and we feel com- 
fortable in making them directly, while we are less sure about other assessments 
and need to approach them indirectly via the relationships implied by coherence. It 
is true that the natural order of assessment does coincide with the “chronological” 
order in a number of practical applications, but it is important to realise that this is 
a pragmatic issue and not a requirement of the theory. 


2.4.3 Conditional Independence 


An important special case of Proposition 2.15 arises when E and G are such 
that P(E |G) = P(E), so that beliefs about F are unchanged by the assumed 
occurrence of G. Not surprisingly, this is directly related to our earlier operational 
definition of (pairwise) independence. 


Proposition 2.20. For all F >@, ELF <— > P(E|F) = P(E). 
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Proof. ELF <=> P(E F) = P(E)P(F) and, by Proposition 2.15, we 
have (PEN F)=P(E|F)P(F). 4 


In the case of three events, E, F and G, the situation is somewhat more 
complicated in that, from an intuitive point of view, we would regard our degree 
of belief for E as being “independent” of knowledge of F and G if and only if 
P(E |H) = P(£), for any of the four possible forms of H. 


{FAG. FONG. FAG". FonG'}. 


describing the combined occurrences, or otherwise, of F' and G (and. of course. 
similar conditions must hold for the “independence” of F from E and G, and of 
G from E and F). These considerations motivate the following formal definition, 
which generalises Definition 2.6 and can be shown (see e.g. Feller, 1950/1968, 
pp. 125-128) to be necessary and sufficient for encapsulating, in the general case, 
the intuitive conditions discussed above. 


Definition 2.12. (Mutual independence). 
Events {Ej € J} are said to be mutually independent if, for any I CJ, 


P (N«) = |] P(). 


rel red 


An important consequence of the fact that coherent degrees of belief combine 
in conformity with the rules of (finitely additive) mathematical probability theory 
is that the task of specifying degrees of belief for complex combinations of events 
is often greatly simplified. Instead of being forced into a direct specification, we 
can attempt to represent the complex event in terms of simpler events. for which 
we feel more comfortable in specifying degrees of belief. The latter are then 
recombined, using the probability rules, to obtain the desired specification for the 
complex event. Definition 2.12 makes clear that the judgement of independence for 
a collection of events leads to considerable additional simplification when complex 
intersections of events are to be considered. Note that Proposition 2.20 derives from 
the uncertainty relation <;- and therefore reflects an inherently personal judgement 
(although coherence may rule out some events from being judged independent: for 
example, any £. F such that@ cE CF CQ). 

There is 4 sense, however. in which the judgement of independence (given 
Alf) for large classes of events of interest reflects a rather extreme form of belief. 
in that scope for learning from experience is very much reduced. This motivates 
consideration of the following weaker form of independence judgement. 
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Definition 2.13. (Conditional independence). The events {E;,j € J} are 
said to be conditionally independent given G > Q if, for any I ¢ J, 


P (n= \c) = []Pe16). 


wel ‘el 


For any subalgebra F of €, the events { Ej, j € J} are said to be conditionally 
independent given F if and only if they are conditionally independent given 
any G > Qin F. 


Definitions 2.12 and 2.13 could, of course, have been stated in primitive terms 
of choices among options, as in Definition 2.6. However, having seen in detail 
the way in which the latter leads to the standard “product definition”, it will be 
clear that a similar equivalence holds in these more general cases, but that the 
algebraic manipulations involved are somewhat more tedious. 


The form of degree of belief judgement encapsulated in Definition 2.13 is one 
which is utilised in some way or another in a wide variety of practical contexts 
and statements of scientific theories. Indeed, a detailed discussion of the kinds of 
circumstances in which it may be reasonable to structure beliefs on the basis of 
such judgements will be a main topic of Chapter 4. Thus, for example, in the prac- 
tical context of sampling, with or without replacement, from large dichotomised 
populations (of voters, manufactured items, or whatever), successive outcomes 
(voting intention, marketable quality, ...) may very often be judged independent, 
given exact knowledge of the proportional split in the dichotomised population. 
Similarly, in simple Mendelian theory, the genotypes of successive offspring are 
typically judged to be independent events, given the knowledge of the two genotypes 
forming the mating. In the absence of such knowledge, however, in neither case 
would the judgement of independence for successive outcomes be intuitively plau- 
sible, since earlier outcomes provide information about the unknown population 
or mating composition and this, in turn, influences judgements about subsequent 
outcomes. For a detailed analysis of the concept of conditional independence, see 
Dawid (1979a, 1979b, 1980b). 


2.4.4 Sequential Revision of Beliefs 


Bayes’ theorem characterises the way in which current beliefs about a set of mu- 
tually exclusive and exhaustive hypotheses, H;, 7 € J. are revised in the light of 
new data, D. In practice, of course, we typically receive data in successive stages, 
so that the process of revising beliefs is sequential. 

As a simple illustration of this process, let us suppose that data are obtained 
in two stages, which can be described by real-world events D, and Dz. Omitting, 
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for convenience. explicit conditioning on Aly. revision of beliefs on the basis of the 
first piece of data D, is described by P(H,|D,) = P(D,| H,)P(H,)/P(D)). 
J € J. When it comes to the further, subsequent revision of beliefs in the light of 
Dy, the likelihoods and prior probabilities to be used in Bayes” theorem are now 
P(Dy|H; 9 D,) and P(H;| D1). j € J. respectively. since all judgements are 
now conditional on D,, We thus have. for all j € ./. 


P(D2| 4,0 D,) PCH, | D,) 
P(Dz| Dj) 


P(H,|D, 9 D.) = 


where P(D2| D,) = Ba P(D2|H; 9 D\)PU1, | Dy). 

From an intuitive standpoint. we would obviously anticipate that coherent 
revision of initial belief in the light of the combined data, ),; % D2. should not 
depend on whether D,. D2 were analysed successively or in combination. This is 
easily verified by substituting the expression for P(H, | D,) into the expression for 
P(A; | D, OM D2), whereupon we obtain 


P(Dz| D,)P(D,) ~ P(D,O D2) 


the latter being the direct expression for P(H; |), D2) from Bayes” theorem 
when D, O Dz is treated as a single piece of data. 

The generalisation of this sequential revision process to any number of stages, 
corresponding to data. D;. Dy..... Dijs dots proceeds straightforwardly. If we 
write D'*) = D,A DyM-++O Dy to denote all the data received up to and including 
stage A. then. for all j € J. 


Pie (Delt ODP LD) 
P(H;|D ) as P( Dy.) | D™) . 
which provides a recursive algorithm for the revision of beliefs. 

There is, however, a potential practical difficulty in implementing this process, 
since there is an implicit need to specify the successively conditioned likelihoods, 
P(Dryi | H; OD"), ) € Ja task which, in the absence of simplifying assump- 
tions, may appear to be impossibly complex if 4 is at all large. One possible 
form of simplifying assumption is the judgement of conditional independence for 
D,. D».... D,,. given any H;.j € J. since. by Definition 2.13. we then only need 
the evaluations P(Dy41 | Hy 9 D"*) = P(D,.; | H,). 7 € J. Another possibility 
might be to assume a rather weak form of dependence by making the judgement 
that a (Markov) property such as P(D,.)|H, 9 DY) = P(Dy.ii H, O De). 
j € J, holds for all 4. As we shall see later. these kinds of simplifying structural 
assumptions play a fundamental role in statistical modelling and analysis. 
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In the case of two hypotheses, H,, H2, the judgement of conditional indepen- 
dence for D,, D2,...,D,,..., given H, or H2, enables us to provide an alternative 
description of the process of revising beliefs by noting that, in this case, 


P(H,|D&*)) — P(A, |D™) y P(De+1| Ai) 
P(H2|D&+)) ~ P(H,|D®) ~ P(Desi | Ho) 


With due regard to the relative nature of the terms prior and posterior, we can 
thus summarise the learning process (in “favour” of H,) as follows: 


posterior odds = prior odds x likelihood ratio. 


In Section 2.6, we shall examine in more detail the key role played by the 
sequential revision of beliefs in the context of complex, sequential decision prob- 
lems. 


2.5 ACTIONS AND UTILITIES 
2.5.1 Bounded Sets of Consequences 


Atthe beginning of Section 2.4, we argued that choices among options are governed, 
in part, by the relative degrees of belief that an individual attaches to the uncertain 
events involved in the options. It is equally clear that choices among options should 
depend on the relative values that an individual attaches to the consequences flowing 
from the events. The measurement framework of Axiom 5(i) provides us with a 
direct, intuitive way of introducing a numerical measure of value for consequences, 
in such a way that the latter has a coherent, operational basis. Before we do this, 
we need to consider a little more closely the nature of the set of consequences C. 
The following special case provides a useful starting point for our development of 
a measure of value for consequences. 


Definition 2.14. (Extreme consequences). The pair of consequences c, and 
c* are called, respectively, the worst and the best consequences ina decision 
problem if, for any other consequence c € C, Cx Sc Sc’. 


It could be argued that a// real decision problems actually have extreme con- 
sequences. Indeed, we recall that all consequences are to be thought of as relevant 
consequences in the context of the decision problem. This eliminates pathological, 
mathematically motivated choices of C, which could be constructed in such a way 
as to rule out the existence of extreme consequences. For example, in mathematical 
modelling of decision problems involving monetary consequences, C is often taken 
to be the real line ® or, in a no-loss situation with current assets k, to be the interval 
[k, oc). Such C’s would not contain both a best and a worst consequence but, on the 
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other hand, they clearly do not correspond to concrete, practical problems. In the 
next section, we shall consider the solution to decision problems for which extreme 
consequences are assumed to exist. 

Nevertheless, despite the force of the pragmatic argument that extreme conse- 
quences always exist, it must be admitted that insisting upon problem formulations 
which satisfy the assumption of the existence of extreme consequences can some- 
times lead to rather tedious complications of a conceptual or mathematical nature. 

Consider, for example, a medical decision problem for which the consequences 
take the form of different numbers of years of remaining life for a patient. Assuming 
that more value is attached to longer survival, it would appear rather difficult to 
justify any particular choice of realistic upper bound, even though we believe there 
to be one. To choose a particular c” would be tantamount to putting forward «° 
years as a realistic possible survival time, but regarding «” + | years as impossible! 
In such cases, it is attractive to have available the possibility, for conceptual and 
mathematical convenience, of dealing with sets of consequences not possessing 
extreme elements (and the same is true of many problems involving monetary 
consequences). For this reason. we shall also deal (in Section 2.5.3) with the 
situation in which extreme consequences are not assumed to exist. 


2.5.2 Bounded Decision Problems 


Let us consider a decision problem (€.C..A.<) tor which extreme consequences 
Cc. < c* are assumed to exist. We shall refer to such decision problems as bounded. 


Definition 2.15. (Canonical utility function for consequences). Given a 
preference relation <, the utility u(e) = a(e|c..c") of a consequence c. 
relative to the extreme consequences c, < c’, is the real number jt(S) asso- 
ciated with any standard event S such that c ~ {c" | S.c.| S°}. The mapping 
u:C — Ris called the utility function. 


It is important to note that the definition of utility only involves comparison 
among consequences and options constructed with standard events. Since the 
preference patterns among consequences is unaffected by additional information. 
we would expect the utility of a consequence to be uniquely defined and to remain 
unchanged as new information is obtained. This is indeed the case. 


Proposition 2.21. (Existence and uniqueness of bounded utilities). For any 
bounded decision problem (E.C.A,.<) with extreme consequences Cc. < ¢, 
(i) for all, ulelc..e*) exists and is unique: 


(ii) the value of u(c|c..c”) is unaffected by the assumed occurrence of an 
event G > 0; 


(iii) 0 = ule, fer.e’) < ulel oe’) < ule fee) = 1. 
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Proof. (i) Existence follows immediately from Axiom 5(i). For uniqueness, 
note that ifc ~ {c*|S1,c.|S{} and c ~ {c*|S2,c.|S5} then, by transitivity 
and Axiom 3(ii), {c* | S1,.c. | S¢} ~ {c* | S2,c. | S$} and S; ~ So; the result now 
follows from Axiom 4(i). 

(ii) To establish this, let c ~ {c*|S,,c. | S[}, so that u(c|c.,e*) = (51); 
using Axiom 4(iii), for any G > @ choose S» such that G 1 S. and y(S2) = ju(S)). 
Then, by Definition 2.6, ¢ ~g {c* | S2,c. | S$} and so the utility of c given G is 
just the original value j:(S2). 

(iii) Finally, since c* = {c* |O,c. |Qh,c. = {e" |Q,c, | O}, and both @ and Q 
belong to the algebra of standard events, we have u(c, |c.,c") = (0) = 0 and 
u(c* |c..c*) = u(Q) = 1. It then follows, from Definition 2.15 and Axiom 4(i), 
that0<u(ela,c") <1. gq 

It is interesting to note that w(c|c,,¢"), which we shall often simply denote 
by u(c), can be given an operational interpretation in terms of degrees of belief. 
Indeed, if we consider a choice between the fixed consequence c¢ and the option 
{c’ | E.c. | E°}, for some event F, then the utility of c can be thought of as defining 
a threshold value for the degree of belief in £’, in the sense that values greater than 
u would lead an individual to prefer the uncertain option, whereas values less than 
u would lead the individual to prefer c for certain. The value w itself corresponds to 
indifference between the two options and is the degree of belief in the occurrence 
of the best, rather than worst, consequence. 


This suggests one possible technique for the experimental elicitation of utilities, a 

subject which has generated a large literature (with contributions from economists 

and psychologists, as well as from statisticians). We shall illustrate the ideas in 

Example 2.3. 

Using the coherence and quantification principles set out in Section 2.3, we 
have seen how numerical measures can be assigned to two of the elements of 
a decision problem in the form of degrees of belief for events and utilities for 
consequences. It remains now to investigate how an overall numerical measure of 
value can be attached to an option, whose form depends both on the events of a 
finite partition of the certain event 2 and on the particular consequences to which 
these events lead. 


Definition 2.16. (Conditional expected utility). 
For any c, < c°,G >, anda = {c;| Ej.j € J}, 


U(a|c..c*,G) = > u(c;|e,.c") P(E; |G) 
jed 


is the expected utility of the option a, given G, with respect to the extreme 
consequences c,, c*. If G = 2, we shall simply write t(a|c..c*) in place of 
U(alc,.e",Q). 
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In the language of mathematical probability theory (see Chapter 3), if the utility 
value of a is considered as a “random quantity”, contingent on the occurrence of 
a particular event E, then @ is simply the expected value of that utility when the 
probabilities of the events are considered conditional on G. 


Proposition 2.22. (Decision criterion for a bounded decision problem). 
For any bounded decision with extreme consequences c. < cand G > 0, 


 <G¢ a, => Tay |c..c0.G) < Waz{o.e.G). 


Proof. Let a, = {¢,;| Eyj.j = 1..... n,}. é = 1.2. By Axioms 5(ii), 4(iii), 
and Proposition 2.13, for all (i. j) there exist S,, and S/, such that 


ay {0S ie lS Syke GG). PISy) = Pls 
Hence, by Proposition 2.10, ¢; ~ {¢*| $.,.¢.| SG} with $,,1(£,; 1 G) and 
P(S;;) = u(e,; |¢.-¢*). By Definition 2.6, for } = 1. 2 and any option a. 
[es By AGhy Sls nj. al Gy 
~ {[(C | Sij.ce | Si) Fy OG) j = 1..... n,. alG}. 


which may be written as {c* | A,.¢. | B,.a|G°}, where A, = U,(E,, A GO S;,) 
and B, = U,(E,; A G NM S¥;). By Propositions 2.14(ii) and 2.16, and using 
Definition 2.5. a, <c¢ ag > Ay <c Ay = P(A){G) < P(A2|G). But. by 
Proposition 2.15. P(EijAGNS,,) = P(E£,;AG)P(S,,) = P(S,,)P(E,, |G)P(G). 
Hence, 


ny, 


P(A\|G) = Souler, 


j=) 


¢..0°) P(E, |G) = (a, |e..e".G) 


and so a; <g a2 © U(aq, |e. 0°. G) < Uag|ce.G). 4 


The result just established is sometimes referred to as the principle of max- 
imising expected utility. In our development, this is clearly not an independent 
“principle”, but rather an implication of our assumptions and definitions. In sum- 
mary form, the resulting prescription for quantitative, coherent decision-making is: 
choose the option with the greatest expected utility. 

Technically, of course, Proposition 2.22 merely establishes. for each <;;, a 
complete ordering of the options considered and does not guarantee the existence 
of an optimal option for which the expected utility is a maximum. However, in 
most (if not all) concrete, practical problems the set of options considered will be 
finite and so a best option (not necessarily unique) will exist. In more abstract 
mathematical formulations, the existence of a maximum will depend on analytic 
features of the set of options and on the utility function v :C — R. 
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Example 2.3. (Utilities of oil wildcatters). One of the earliest reported systematic 
attempts at the quantification of utilities in a practical decision-making context was that of 
Grayson (1960), whose decision-makers were oil wildcatters engaged in exploratory searches 
for oil and gas. The consequences of drilling decisions and their outcomes are ultimately 
changes in the wildcatters’ monetary assets, and Grayson’s work focuses on the assessment 
of utility functions for this latter quantity. 

For the purposes of i!lustration, suppose that we restrict attention to changes in monetary 
assets ranging, in units of one thousand dollars, from — 150 (the worst consequence) to +825 
(the best consequence). Assuming u(—150) = 0, u(825) = 1, the above development 
suggests ways in which we might try to elicit an individual wildcatter’s values of u(c) for 
various cin the range —150 < c < 825. For example, one could ask the wildcatter, using a 
series of values of ¢, which option he or she would prefer out of the following: 


(i) ¢ for sure, 
(ii) entry into a venture having outcome 825 with probability p and an outcome — 150 with 
probability 1 — p, for some specified p. 


If ¢,, emerges from such interrogation as an approximate “indifference” value, the theory 
developed above suggests that, for a coherent individual, 


u(c,) = pu(825) + (1 — p) u(-150) = p. 


Repeating this exercise for a range of values of p, provides a series of (¢,.p) pairs, from 
which a “picture” of u(c) over the range of interest can be obtained. An alternative procedure, 
of course, would be to fix c, perform an interrogation for various p until an “indifference” 
value, p, is found, and then repeat this procedure for a range of values of « to obtain a series 
of (¢. p,.) pairs. 


Utility 


Thousands of dollars 


-200 0 200 400 600 800 


Figure 2.3. William Beard's utility function for changes in monetary assets 
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Figure 2.3 shows the results obtained by Grayson using procedures of this kind to 
interrogate oi] company executive. W. Beard, on October 23. 1957. A “picture” of Beard’s 
utility function clearly emerges from the empirical data. In particular, over the range con- 
cerned, the utility function reflects considerable risk aversion, in the sense that even quite 
small asset losses lead to large (negative) changes in utility compared with the (positive) 
changes associated with asset gains. 


Since the expected utility 7 is a linear combination of values of the utility 
function, Proposition 2.22 guarantees that preferences among options are invariant 
under changes in the origin and scale of the utility measure used; i.e.. invariant 
with respect to transformations of the form Au(.) + B. provided we take 4 > 0. 
so that the orientation of “best” and “worst” is not changed. In general, therefore. 
such an origin and scale can be chosen for convenience in any given problem. 
and we can simply refer to the expected utility of an option without needing to 
specify the (positive linear) transformation of the utility function which has been 
used. However, there may be bounded decision problems where the probabilistic 
interpretation discussed above makes it desirable to work in terms of canonical 
utilities. derived by referring to the best and worst consequences. 

In the next section, we shall provide an extension of these ideas to more general 
decision problems where extreme consequences are not assumed to exist. 


2.5.3 General Decision Problems 


We begin with a more general definition of the utility of a consequence which 
preserves the linear combination structure and the invariance discussed above. 


Definition 2.17. (General utility function). Given a preference relation <., 
the utility u(c | Cy. co) ofa consequence c, relative to the consequences () < (2, 
is defined to be the real number u such that 


ife <ey and cy ~ {c9|S,.¢] S)}. then us -a/(1 -.r): 
ifey Se cgando~ {eo.[S,.e)|S¢}. then u = 0: 
ife > cpandcs ~ {e 


Set SC}, then w= Vir 


where wv = u(S,) is the measure associated with the standard event S,. 


Our restricted definition of utility (Definition 2.15) relied on the existence 
of extreme consequences ¢., «°, such that¢, < e < c” for all e € CL In the 
absence of this assumption, we have to select some reference consequences. (1.2 
to play the role of c,.c’. However. we cannot then assume that ¢) < ¢ < ¢» for 
all ¢, and this means that if ¢;. ¢ are to define a utility scale by being assigned 
values 0.1, respectively, we shall require negative assignments for ¢ < ¢, and 
assignments greater than one for ¢ > ¢2. The definition is motivated by a desire 
to maintain the linear features of the utility function obtained in the case where 
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extreme consequences exist. It can be checked straightforwardly that if cj), C2), 
3) denote any permutation of c,c),c2, where c; < cz and cy) < cig) < c3), the 
definition given ensures that for any G > 0, (2) ~G {¢(3) | Sr. (1) | SE} implies 
that 
u(c(a) | c1,€2) = x ules) | e1.¢2) + (2 — 2) u(eqy fer, e2). 
The following result extends Proposition 2.21 to the general utility function 
defined above. 


Proposition 2.23. (Existence and uniqueness of utilities). For any decision 
problem, and for any pair of consequences c, < Co, 


(i) for all c,u(c|c), cz) exists and is unique; 
(ii) the value ofu(c | c1, cz) is unaffected by the occurrence ofan eventG > 9; 
(iii) u(ey | cy.c2) = 0 and u(c2|c),c2) = 1. 


Proof. This is virtually identical to the proof of Proposition 2.21. 


The following results guarantee that the utilities of consequences are linearly 
transformed if the pair of consequences chosen as a reference is changed. 


Proposition 2.24. (Linearity). For all cy < cg and c3 < cy there exist A > 0 
and B such that, for all c, u(c|c,.c2) = Au(c}c3.c1) + B. 


iom S(ii), c; < ¢ < cy implies that there exists a standard event S, such that 
c~ {cx| S,.c3 | So}. Hence, by Proposition 2.22, 


Proof. Suppose first that cz > c1, Cy < Cy, and c) < c < cy. By Ax- 


u(e|ci.c2) = ru(cy|c1.c2) + (1 — 2)u(cs3 | cr. c2), 


where x = P(S,) and, by Definition 2.17, u(c|c3,c1) = x. Hence, u(c|c1.c2) = 
Au(c|c3,¢4) + B, where A = u(ey] ¢1,c2) — u(c3 |e), ¢2) and B = u(c3 | c).c2). 
By Axiom 5S(ii), if cz > c there exists S, such that cz ~ {cy| Sy.c| S/}. 
Hence, by Proposition 2.22, 
u(c3 | c1.c2) = yu(cs |ci.¢2) + (1 — y)u(e| er. ce), 
where y = P(S,) and. by Definition 2.17, u(¢|c3,¢c1) = —y/(1 — y). Hence, 
u(c|¢1,¢2) = Au(e|c3,c,) + B, with A and B as above. Similarly, if c > c, there 
exists S, such that cy ~ {c| S;,c3| SS} and 
u(ey|ey.c2) = yu(c|er,c2) + (1 — y)u(es fer. e2), 
where y = P(S,) and, by Definition 2.17, u(c|c3,c1) = 1/y. Hence, we have 
u(c|c1,¢2) = Au(c|c3,c3) + B, with A and B as above. 
Now suppose that the c’s have arbitrary order, subject to cp > 1, G1 > ¢3. 
Let c..c* be the minimum and maximum, respectively, of {¢1.C2,c3,C4,c}. Then, 
by the above, there exist A;, B,, A», Bg such that, for cj) € {c1.c2,¢3,¢4,¢}, 
u(cgy |e, e7) = Ayu(egy |er.c2) + By and u(ec;) |e..*) = Agu(eqy | es. ¢4) + Bas 
hence, u(ci) | c1.¢2) = (A2/Ar)u(eq | es. ca) + (Be — Bi)/Ar. 
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Finally, we generalise Proposition 2.22 to unbounded decision problems, 


Proposition 2.25. (General decision criterion). 
For any decision problem, pair of consequences ¢, < cy. and eventG > 4, 


ay Se ay > May | Cy.e2.G) < Uaz | e,.c2.G). 


Proof. Suppose a, = {| E.,.j7 = 1..-.. n,}.¢ = 1.2. and let ¢..c” be 
such that for all ci. ¢. S ci, 


and B such that u(e|c..c") = Au(e|e;.c2) + B. and so the result follows. 


An immediate implication of Proposition 2.25 is that all options can be com- 
pared among themselves. We recall that we did nor directly assume that compar- 
isons could be made between all pair of options (an assumption which is often 
criticised as unjustified: see, for example. Fine 1973, p. 221). Instead. we merely 
assumed that all consequences could be compared among themselves and with the 
(very simply structured) standard dichotomised options, and that the latter could 
be compared among themselves. 

This completes our elaboration of the axiom system set out in Section 2.3. 
Starting from the primitive notion of preference. <, we have shown that quantitative. 
coherent comparisons of options must proceed as if a utility function has been 
assigned to consequences, probabilities to events and the choice of an option made 
on the basis of maximising expected utility. 

If we begin by defining a utility function over u : C — R. this induces in turn 
a preference ordering which is necessarily coherent. Any function can serve as a 
utility function (subject only to the existence of the expected utility for each option, 
a problem which does not arise in the case of finite partitions) and the choice is a 
personal one. In some contexts, however, there are further formal considerations 
which may delimit the form of function chosen. An important special case is 
discussed in detail in Section 2.7. 


2.6 SEQUENTIAL DECISION PROBLEMS 
2.6.1 Complex Decision Problems 


Many real decision problems would appear to have a more complex structure than 
that encapsulated in Definition 2.1. For instance, in the fields of market research 
and production engineering investigators often consider first whether or not to 
run a pilot study and only then, in the light of information obtained (or on the 
basis of initial information if the study is not undertaken). are the major options 
considered. Such a two-stage process provides a simple example of a sequential 
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decision problem, involving successive, interdependent decisions. In this section, 
we shall demonstrate that complex problems of this kind can be solved with the 
tools already at our disposal, thus substantiating our claim that the principles of 
quantitative coherence suffice to provide a prescriptive solution to any decision 
problem. 

Before explicitly considering sequential problems, we shall review, using a 
more detailed notation, some of our earlier developments. 

Let A = {a;,i € 1} be the set of alternative actions we are willing to consider. 
For each a;, there is a class {£;;,7 € Jj} of exhaustive and mutually exclusive 
events, which label the possible consequences {c,;, 7 € Ji} which may result from 
action a;. Note that, with this notation, we are merely emphasising the obvious 
dependence of both the consequences and the events on the action from which they 
result. If Ap is our initial state of information and G > 0 is additional information 
obtained subsequently, the main result of the previous section (Proposition 2.25) 
may be restated as follows. 


For behaviour consistent with the principles of quantitative coherence, action 
a, is to be preferred to action a2, given Mp and G, if and only if 


U(ay |G) > U(ag | G), 


where 
Ua; |G) = > u(ci,) P(E, | ai, Mo, G), 


J€d; 


u(c;;) is the value attached to the consequence foreseen if action a; is taken and the 
event E;; occurs, and P(E;; | a;, Mo, G) is the degree of belief in the occurrence of 
event E;;, conditional on action a, having been taken, and the state of information 
being (Mpo,G). 


We recall that the probability measure used to compute the expected utility is taken 
to be a representation of the decision-maker’s degree of belief conditional on the 
total information available. By using the extended notation P(E;; | a,.G, Afy), 
rather than the more economical P(E; | G) used previously, we are emphasising 
that (i) the actual events considered may depend on the particular action envisaged, 
(ii) the information available certainly includes the initial information together 
with G > 0, and (iii) degrees of belief in the occurrence of events such as E;,; are 
understood to be conditional on action a; having been assumed to be taken, so 
that the possible influence of the decision-maker on the real world is taken into 
account. 


For any action a;, it is sometimes convenient to describe the relevant events 
£,;, 7 € J, in a sequential form. For example, in considering the relevant events 
which label the consequences of a surgical intervention for cancer, one may first 
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think of whether the patient will survive the operation and then, conditional on 
survival, whether or not the tumour will eventually reappear were this particular 
form of surgery to be performed. 

These situations are most easily described diagrammatically using decision 
trees. such as that shown in Figure 2.4, with as many successive random nodes 
as necessary. Obviously, this does not represent any formal departure from our 
previous structure. since the problem can be restated with a single random node 
where relevant events are defined in terms of appropriate intersections. such as 
E., 0 Fj, in the example shown. It is also usually the case. in practice, that it 
is easier to elicit the relevant degrees of belief conditionally. so that, for example. 
P(E;, 0 Fi, |a,.G. Aly) would often be best assessed by combining the separately 
assessed terms P( Fj, | E,,.a,.G. Aly) and P(E); |a,.G. Mu). 


aah 


Figure 2.4 Conditional description of relevant events 


Conditional analysis of this kind is usually necessary in order to understand the 
structure of complicated situations. Consider. for instance, the problem of placing 
a bet on the result of a race after which the total amount bet is to be divided up 
among those correctly guessing the winner. Clearly. if we bet on the favourite we 
have a higher probability of winning, but. if the favourite wins, many people will 
have guessed correctly and the prize will be small. It may appear at first sight that 
this is a decision problem where the utilities involved in an action (the possible 
prizes to be obtained from a bet) depend on the probabilities of the corresponding 
uncertain events (the possible winning horses). a possibility no? contemplated 
in our structure. A closer analysis reveals. however, that the structure of the 
problem is similar to that of Figure 2.4. The prize received depends on the bet 
you place (a;) the related betting behaviour of other people ( £,,) and the outcome 
of the race (F,,,). It is only natural to assume that our degree of belief in the 
possible outcomes of the race may be influenced by the betting behaviour of other 
people. This conditional analysis straightforwardly resolves the initial. apparent 
complication. 


We now turn to considering sequences of decision problems. We shall consider 
situations where, after an action has been taken and its consequences observed, a 
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new decision problem arises, conditional on the new circumstances. For exam- 
ple, when the consequences of a given medical treatment have been observed, a 
physician has to decide whether to continue the same treatment, or to change to an 
alternative treatment, or to declare the patient cured. 

If a decision problem involves a succession of decision nodes, it is intuitively 
obvious that the optimal choice at the first decision node depends on the optimal 
choices at the subsequent decision nodes. In colloquial terms, we typically cannot 
decide what to do today without thinking first of what we might do tomorrow, 
and that, of course, will typically depend on the possible consequences of today’s 
actions. In the next section, we consider a technique, backward induction, which 
makes it possible to solve these problems within the framework we have already 
established. 


2.6.2 Backward Induction 


In any actual decision problem, the number of scenarios which may be contemplated 
at any given time is necessarily finite. Consequently, and bearing in mind that the 
analysis is only strictly valid under certain fixed general assumptions and we cannot 
seriously expect these to remain valid for an indefinitely long period, the number of 
decision nodes to be considered in any given sequential problem will be assumed to 
be finite. Thus, we should be able to define a finite horizon, after which no further 
decisions are envisaged in the particular problem formulation. If, at each node, the 
possibilities are finite in number, the situation may be diagrammatically described 
by means of a decision tree like that of Figure 2.5. 


Ex.j€ ae 


Figure 2.5 Decision tree with several decision nodes 


Let n be the number of decision stages considered and let a‘™) denote an 
action being considered at the mth stage. Using the notation for composite options 
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introduced in Section 2.2, all first-stage actions may be compactly described in the 
form 
(i 


i 


Bi -pse,- 
a= max at [£,;. 7 € J, 
keh; | 


where {E,,.j € hae is the partition of relevant events which corresponds to 


(1) (2)ys 


a, and the notation “max a,.’” refers to the most preferred of the set of options 


fae. k € K;,} which we would be confronted with were the event E;; to occur. 
The “maximisation” is naturally to be understood in the sense of our conditional 
preference ordering among the available second-stage options, given the occurrence 
of E,,. Indeed, the “consequence” of choosing a and having F;; occur is that we 


are confronted with a set of options {ae kKER; ,} from which we can choose that 
option which is preferred on the basis of our pattern of preferences at that stage. 
Similarly, second-stage options may be written in terms of third-stage options, and 
the process continued until we reach the nth stage, consisting of “ordinary” options 
defined in terms of the events and consequences to which they may lead. Formally. 
we have 


(i+) . 1 ‘ 
a”) =) maxal 9B. Fer). ma ted.... n—1. 
kek;, 


a {eu | Eyed € he : 


i 


It is now apparent that sequential decision problems are a special case of the general 
framework which we have developed. 

It follows from Proposition 2.25 that, at each stage m, if G,,, is the relevant 
information available, and u(.) is the (generalised) utility function, we may write 


a <¢ Pua = {ai | Gu} < U {a IGu} . 


=Gm J 


where 


7 (nr) ‘an (a+) , - os 
U {a [Gun } = y max U {ai Gna} P(E, |Gu). 


k . 
ve ght) sa 
4 
a {al |Gb = S7 wle,)P(E,1Gu): 
(n) 
jes: 


This means that one has to first solve the final (th) stage, by maximising the 
appropriate expected utility; then one has to solve the (77 — 1)th stage by maximizing 
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the expected utility conditional on making the optimal choice at the nth stage; and 
so on, working backwards progressively, until the optimal first stage option has 
been obtained, a procedure often referred to as dynamic programming. 

This process of backward induction satisfies the requirement that, at any stage 
of the procedure, the mth, say. the continuation of the procedure must be identical 
to the optimal procedure starting at the zrth stage with information G,,,. This re- 
quirement is usually known as Bellman’s optimality principle (Bellman, 1957). As 
with the “principle” of maximising expected utility, we see that this is not required 
as a further assumed “principle” in our formulation, but is simply a consequence 
of the principles of quantitative coherence. 


Example 2.4. (An optimal stopping problem). We now consider a famous problem. 
which is usually referred to in the literature as the “marriage problem” or the “secretary 
problem’. Suppose that a specified number of objects n > 2 are to be inspected sequentially, 
One at a time. in order to select one of them. Suppose further that, at any stage r, 1 <r <n, 
the inspector has the option of either stopping the inspection process. receiving, as a result, 
the object currently under inspection, or of continuing the inspection process with the next 
object. No backtracking is permitted and if the inspection process has not terminated before 
the nth stage the outcome is that the th object is received. At each stage. r, the only 
information available to the inspector is the relative rank (J=best, r=worst) of the current 
object among those inspected so far. and the knowledge that the 7: objects are being presented 
in a completely random order. 

When should the inspection process be terminated? Intuitively. if the inspector stops 
too soon there is a good chance that objects more preferred to those seen so far will remain 
uninspected. However, if the inspection process goes on too long there is a good chance that 
the overall preferred object will already have been encountered and passed over. 

This kind of dilemma is inherent in a variety of practical problems, such as property 
purchase in a limited seller's market when a bid is required immediately after inspection. 
or staff appointment in a skill shortage area when a job offer is required immediately after 
interview. More exotically—and assuming a rather egocentric inspection process, again 
with no backtracking possibilities—this stopping problem has been suggested as a model 
for choosing a mate. Potential partners are encountered sequentially; the proverb “marry in 
haste. repent at leisure” warns against settling down too soon; but such hesitations have to 
be balanced against painful future realisations of missed golden opportunities. 


Less romantically, letc;,? = 1..... 71, denote the possible consequences of the inspec- 
tion process, with «, = 7 if the eventual object chosen has rank i out of all 1 objects. We 
shall denote by u(c,) = u(i), 7 = 1..... 2, the inspector's utility for these consequences. 


Now suppose that r < 7 objects have been inspected and that the relative rank among 
these of the object under current inspection is .r, where 1 <4 < r. There are two actions 
available at the rth stage: a, = stop, a) = continue (where. to simplify notation. we have 
dropped the superscript, 7). The information available at the rth stage is G, = (r.1r): 
the information available at the (+ + 1)th stage would be G.., = (yr + 1), where y, 
1<y <r +1. is the rank of the next object relative to the + + 1 then inspected. all values 
of y being. of course. equally likely since the 77 objects are inspected in a random order. If 
we denote the expected utility of stopping. given G,. by @,(..1) and the expected utility 
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of acting optimally, given G,., by & (2. 1), the general development given above establishes 
that 


l rat 
Ty(ar) = max [rte r). el ze tly. + u)} ‘ 
ie 


where 


To(rn) = ren) = ula). owe bee. n. 
Values of (2.17) can be found from the final condition and the technique of backwards 
induction. The optimal procedure is then seen to be: 
(i) continue if %(a.r) > Hr). 
(ii) stop if (rir) = a (rer). 
For illustration, suppose that the inspector’s preference ordering corresponds to a “noth- 


ing but the best” utility function, defined by u(1) = Ll. u(r) = Or = 2..... n. [tis then 
easy to show that 


thus, if > 1, 
Tater) >WGrr). redl... n—-l. 


This implies that inspection should never be terminated if the current object is not the best 
seen so far. The decision as to whether to stop if .r = | is determined from the equation 


_ ror 1 ! 
tote) = nnax {*. ~(—_ toot ‘)\ : 
non\n-~ ] r 


which is easily verified by induction. If 1" is the smallest positive integer for which 


l 1 


=F 


n—4 n-2 


1. 
to bo sh. 
r 


the optimal procedure is defined as follows: 
(i) continue until at least +” objects have been inspected: 
Gi) if the rth object is the best so far. stop: 
(iii) otherwise. continue until the object under inspection is the best so far. then stop (stop- 
ping in any case if the th stage is reached). 
If n is large. approximation of the sum in the above inequality by an integral readily 
yields the approximation > = /¢. For further details. see DeGroot (1970, Chapter 13). 


whose account is based closely on Lindley (19614). For reviews of further, related work on 
this fascinating problem. see Freeman (1983) and Ferguson (1989). 
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Applied to the problem of “choosing a mate”, and assuming that potential partners are 
encountered uniformly over time between the ages of 16 and 60, the above analysis suggests 
delaying a choice until one is at least 32 years old, thereafter ending the search as soon as one 
encounters someone better than anyone encountered thus far. Readers who are suspicious 
of putting this into practice have the option, of course, of staying at home and continuing 
their study of this volume. 


Sequential decision problems are now further illustrated by considering the 
important special case of situations involving an initial choice of experimental 
design. 


2.6.3 Design of Experiments 


A simple, very important example of a sequential problem is provided by the 
situation where we have available a class of experiments, one of which is to be 
performed in order to provide information for use in a subsequent decision problem. 
We want to choose the “best” experiment. The structure of this problem, which 
embraces the topic usually referred to as the problem of experimental design, may 
be diagrammatically described by means of a sequential decision tree such as that 
shown in Figure 2.6. 


u({a,e, D,, E;) 
ey 


E, ula. ey. E;) 


Figure 2.6 Decision tree for experimental design 


We must first choose an experiment e and, in light of the data D obtained, 
take an action a, which, were event FE to occur, would produce a consequence 
having utility which, modifying earlier notation in order to be explicit about the 
elements involved, we denote by u(a.e,D, £). Usually, we also have available 
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the possibility, denoted by «, and referred to as the null experiment, of directly 
choosing an action without performing any experiment. 

Within the general structure for sequential decision problems developed in the 
previous section, we note that the possible sets of data obtainable may depend on 
the particular experiment performed, the set of available actions may depend on 
the results of the experiment performed, and the sets of consequences and labelling 
events may depend on the particular combination of experiment and action chosen. 
However, in our subsequent development we will use a simplified notation which 
suppresses these possible dependencies in order to centre attention on other, more 
important, aspects of the problem. 

We have seen, in Section 2.6.2, that to solve a sequential decision problem 
we start at the last stage and work backwards. In this case, the expected utility of 
option a, given the information available at the stage when the action is to be taken, 
is 

a(a.e.D,) = S° ula.e. D,.E,) P(E, |e. D,.a). 
ped 

For each pair (¢. D;) we can therefore choose the best possible continuation: 
namely, that action a? which maximises the expression given above. Thus, the 
expected utility of the pair (¢. D,) is given by 


u(e. D,) = Wa) .e. Dj) = max ti(a.c. D,). 


We are now in a position to determine the best possible experiment. This is 
that «© which maximises, in the class of available experiments, the unconditional 
expected utility 

Tc) = $_ aa}.¢. D,)P(D, |e). 
vel 
where P(D, |e) denotes the degree of belief attached to the occurrence of data 
D; if « were the experiment chosen. On the other hand, the expected utility of 
performing no experiment and choosing that action a, which maximises the (prior) 
expected utility is 


U(ey) = Wah. co) = max De ulacey. By. PCE, [euca). 
a 
ged 
so that an experiment ¢ is worth performing if and only if 7c) > We). 
Naturally, (a,c. D,). (ec. D,) and Wc) are different functions defined on 
different spaces. However. to simplify the notation and without danger of confusion 
we shall always use i to denote an expected utility. 


Proposition 2.26. (Optimal experimental design). The optimal action is to 
perform the experiment &* if We’) > Wey) and Wc) = max, We): other- 
wise, the optimal action is to perform no experiment. 


Proof. This is immediate from Proposition 2.25. 4 
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It is often interesting to determine the value which additional information 
might have in the context of a given decision problem. 

The expected value of the information provided by new data may be computed 
as the (posterior) expected difference between the utilities which correspond to 
optimal actions after and before the data have been obtained. 


Definition 2.18. (The value of additional information). 
(i) The expected value of the data D; provided by an experiment e is 
v(e. D;) = ys {u(a;,e, Dj, Ej) — u(ag, eo, E;)} P(E; |e. Di, a7): 
Jed 
where a}, aj are, respectively, the optimal actions given D;, and with no 
data. 
(ii) the expected value of an experiment ¢ is given by 


v(e) = >> v(e. D;)p(D, |e). 


iel 


It is sometimes convenient to have an upper bound for the expected value v(e) 
of an experiment e. Let us therefore consider the optimal actions which would be 
available with perfect information, i.e., were we to know the particular event E; 
which will eventually occur, and let a/;, be the optimal action given Ej, i.e., such 
that, for all £,, 

u(a(,).€0, £;) = max u(a, €9, E;). 


Then, given F;,, the loss suffered by choosing any other action a will be 
ular), €g, E;) = u(a, eu, E}). 


For a = ag, the optimal action under prior information, this difference will mea- 
sure, conditional on £7, the value of perfect information and, under appropriate 
conditions, its expected value will provide an upper bound for the increase in utility 
which additional data about the E;’s could be expected to provide. 


Definition 2.19. (Expected value of perfect information). The opportunity 
loss which would be suffered if action a were taken and event E, occurred is 


l(a, E;) = max u(a;, eo. Ej) — u(a, eo, Ej): 
the expected value of perfect information is then given by 


v" (eo) = S_ l(aj. E,) P(E; | a5). 


Jed 
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It is important to bear in mind that the functions v(D,).v(e) and the number 
t” (ey), all crucially depend on the (prior) probability distributions {(/(£; | a). 
a € A} although, for notational convenience. we have not made this dependency 
explicit. 

In many situations, the utility function u(a.e. D,. Ej) may be thought of as 
made up of two separate components. One is the (experimental) cost of performing 
e and obtaining D,; the other is the (terminal) wrility of directly choosing a and then 
finding that E, occurs. Often, the latter component does not actually depend on the 
preceding ¢ and D,, so that, assuming additivity of the two components. we may 
write u(a.¢.D,. Ej) = u(a.eo. £)) — e(e. D,) where c(e. D,) > 0. Moreover. 
the probability distributions over the events are often independent of the action 
taken. When these conditions apply. we can establish a useful upper bound for the 
expected value of an experiment in terms of the difference between the expected 
value of complete data and the expected cost of the experiment itself. 


Proposition 2.27. (Additive decomposition). If the utility function has the 
form 


u(a.e.D;. E)) = u(a.en. £)) — e(e. D,). 
with c(e. D;) > 0, and the probability distributions are such that 
P(E; |e. Di.a) = p(E,le.Di). pl Ej |eo.@) = p(E; | en). 
then, for any available experiment e, 
v(e) <r" (en) — ele). 
where 


ee) = S> e(e. D,)P(D, |e) 


icl 


is the expected cost of €. 


Proof. Using Definitions 2.18 and 2.19, +(e) may be written as 


- bx {u(ay. ey. Ej) — e(e. D,) — ulai.eo. E,)} P(E; Ie. D)| P(D, |e) 


tel ged 


->- pa ats a.t, Ej) — ulez,eo. E 5) PC E; |e. D,) )] P(D, Le) Ts 


ed 


< yD: [ max u( a.e9. Ej) — ulag. €0. E | P(E, D) |e) — Ge) 


ef jel 
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and, hence, 


v(e) < { }°U(a, B;) P(E; leo) }{ $2 P(Ds| By,€)} - ale) 


jed ief 
= { >= Ka, E;)P(E; |eo)} = ¢(e) 
jes 


= v"(eo) — Ze), 
as stated. gq 


In Section 2.7, we shall study in more detail the special case of experimental 
design in situations where data are being collected for the purpose of pure inference, 
rather than as an input into a directly practical decision problem. 

We have shown that the simple decision problem structure introduced in Sec- 
tion 2.2, and the tools developed in Sections 2.3 to 2.5, suffice for the analysis of 
complex, sequential problems which, at first sight, appear to go beyond that simple 
structure. In particular, we have seen that the important problem of experimental 
design can be analysed within the sequential decision problem framework. We 
shall now use this framework to analyse the very special form of decision problem 
posed by statistical inference, thus establishing the fundamental relevance of these 
foundational arguments for statistical theory and practice. 


2.7. INFERENCE AND INFORMATION 
2.7.1. Reporting Beliefs as a Decision Problem 


The results on quantitative coherence (Sections 2.2 to 2.5) establish that if we aspire 
to analyse a given decision problem, {€,C,.A, <}, in accordance with the axioms of 
quantitative coherence, we must represent degrees of belief about uncertain events 
in the form of a finite probability measure over € and values for consequences in 
the form of a utility function over C. Options are then to be compared on the basis 
of expected utility. 

The probability measure represents an individual’s beliefs conditional on his or 
her current state of information. Given the initial state of information described by 
Mo and further information in the form of the assumed occurrence of a significant 
event G, we previously denoted such a measure by P{.|G). We now wish to 
specialise our discussion somewhat to the case where G can be thought of as a 
description of the outcome of an investigation (typically a survey, or an experiment) 
involving the deliberate collection of data (usually, in numerical form). The event 
G will then be defined directly in terms of the counts or measurements obtained, 
either as a precise statement, or involving a description of intervals within which 
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readings lie. To emphasise the fact that G characterises the actual dara collected. 
we shall denote the event which describes the new information obtained by D. 
An individual's degree of belief measure over € will then be denoted P(.| D) 
representing the individual's current beliefs in the light of the data obtained (where. 
again, we have suppressed, for notational convenience, the explicit dependence on 
Mp). So far as uncertainty about the events of € is concerned. /?(. | D) constitutes 
a complete encapsulation of the information provided by D, given the initial state 
of information Af). Moreover, in conjunction with the specification of a utility 
function, P(.| D) provides all that is necessary for the calculation of the expected 
utility of any option and. hence, for the solution of any decision problem defined 
in terms of the frame of reference adopted. 

Starting from the decision problem framework, we thus have a formal justift- 
cation for the main topic of this book; namely, the study of models and techniques 
for analysing the ways in which beliefs are modified by data. However, many em- 
inent writers have argued that basic problems of reporting scientific inferences do 
not fall within the framework of decision problems as defined in earlier sections: 


Statistical inferences involve the data. a specification of the set of possible pop- 
ulations sampled and a question concerning the true population. .. Decisions 
are based on not only the considerations listed for inferences, but also on an 
assessment of the losses resulting from wrong decisions. .. (Cox, 1958): 


... a considerable body of doctrine has attempted to explain. or rather to reinter- 
pret these (significance) tests on the basis of quite a different model. namely as 
means to making decisions in an acceptance procedure. The differences between 
these two situations seem to the author many and wide. ... (Fisher. 1956/1973). 


If views such as these were accepted, they would. of course, undermine our 
conclusion that problems concerning uncertainty are to be solved by revising de- 
grees of belief in the light of new data in accordance with Bayes’ theorem. Our 
main purpose in this section is therefore to demonstrate that the problem of reporting 
inferences is essentially a special case of a decision problem. 

By way of preliminary clarification, let us recall from Section 2.1 that we 
distinguished two, possibly distinct, reasons for trying to think rationally about 
uncertainty. On the one hand, quoting Ramsey (1926), we noted that, even if an 
immediate decision problem does not appear to exist, we know that our statements 
of uncertainty may be used by others in contexts representable within the decision 
framework. In such situations, our conclusion holds. On the other hand. quoting 
Lehmann (1959/1986), we noted that the inference, or inference statement. may 
sometimes be regarded as an end in itself. to be judged independently of any 
“practical” decision problem. It is this case that we wish to consider in more detail 
in this section, establishing that, indeed, it can be regarded as falling within the 
general framework of Sections 2.2 to 2.5. 
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Formalising the first sentence of the remark of Cox, given above, a pure 
inference problem may be described as one in which we seek to learn which of a 
set of mutually exclusive “hypotheses” (“theories”’, “states of nature”, or “model 
parameters’) is true. From a strictly realistic viewpoint, there is always, implicitly, 
a finite set of such hypotheses, say { H;. j € J}, although it may be mathematically 
convenient to work as if this were not the case. We shall regard this set of hypotheses 
as equivalent to a finite partition of the certain event into events {Ej, 7 € J}, 
having the interpretation E; = “the hypothesis H; is true”. The actions available 
to an individual are the various inference statements that might be made about 
the events {E,, j € J}, the latter constituting the uncertain events corresponding 
to each action. To complete the basic decision problem framework, we need to 
acknowledge that, corresponding to each inference statement and each E,, there 
will be a consequence; namely, the record of what the individual put forward as an 
appropriate inference statement, together with what actually turned out to be the 
case. 

If we aspire to quantitative coherence in such a framework, we know that 
our uncertainty about the {E), j € J} should be represented by {P(E;|D), 
j € J}, where P(.|D) denotes our current degree of belief measure, given data 
D in addition to the initial information Afp. It is natural, therefore, to regard the 
set of possible inference statements as the class of probability distributions over 
{E;, j © J} compatible with the information D. The inference reporting problem 
can thus be viewed as one of choosing a probability distribution to serve as an 
inference statement. But there is nothing (so far) in this formulation which leads to 
the conclusion that the best action is to state one’s actual beliefs. Indeed, we know 
from our earlier development that options cannot be ordered without an (implicit 
or explicit) specification of utilities for the consequences. We shall consider this 
specification and its implications in the following sections. A particular form 
of utility function for inference statements will be introduced and it will then be 
seen that the idea of inference as decision leads to rather natural interpretations of 
commonly used information measures in terms of expected utility. In the discussion 
which follows, we shall only consider the case of finite partitions {F;. j € J}. 
Mathematical extensions will be discussed in Chapter 3. 


2.7.2 The Utility of a Probability Distribution 


We have argued above that the provision of a statistical inference statement about 
a class of exclusive and exhaustive “hypotheses” {£;. 7 € J}, conditional on 
some relevant data D, may be precisely stated as a decision problem, where the 
set of “hypotheses” {£;, j € J} is a partition consisting of elements of €, and the 
action space A relates to the class Q of conditional probability distributions over 
{E;. j € J}; thus, 


Q= {q =(q-J EJ) Gg) 29. phe i= 1}, 
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where q, is assumed to be the probability which, conditional on the available data 
D, an individual reports as the probability of E; = H, being true. The set of 
consequences C, consists of all pairs (gq. E,), representing the conjunctions of 
reported beliefs and true hypotheses. The action corresponding to the choice of q 
is defined as {(q. E;)| Ej. j € J}. 

To avoid triviality, we assume that none of the hypotheses is certain and that, 
without loss of generality, all are compatible with the available data; i.e.. that all 
the E’,'s are significant given D, so that (Proposition 2.5) @ < E, D < D for all 
j € J. If this were not so, we could simply discard any incompatible hypotheses. 
It then follows from Proposition 2.17(iii) that each of the personal degrees of belief 
attached by the individual to the conflicting hypotheses given the data must be 
strictly positive. Throughout this section, we shal! denote by 


p= (p= P(E,|D) JES). yy >0 DI Hd 


the probability distribution which describes, conditional again on the available data 
D, the individual’s actual beliefs about the alternative “hypotheses”. 


We emphasise again that, in the structure described so far, there is no logical 
requirement which forces an individual to report the probability distribution p 
which describes his or her personal beliefs, in preference to any other probability 
distribution g in Q. 


We complete the specification of this decision problem by inducing the prefer- 
ence ordering through direct specification of a utility function «(.), which describes 
the “value” u(q, £;) of reporting the probability distribution gq as the final inferen- 
tial summary of the investigation, were E’, to turn out to be the true “state of nature”. 
Our next task is to investigate the properties which such a function should possess 
in order to describe a preference pattern which accords with what a scientific com- 
munity ought to demand of an inference statement. This special class of utility 
functions is often referred to as the class of score functions (see also Section 2.8) 
since the functions describe the possible “scores” to be awarded to the individual 
as a “prize” for his or her “prediction”. 


Definition 2.20. (Score function). A score function u for probability distri- 
butions q = {q,.j © J} defined over a partition {E;. j € J} is a mapping 
which assigns a real number {q. E:,} to each pair (q. E.,). This function is 
said to be smooth if it is continuously differentiable as a function of each q,. 


It scems natural to assume that score functions should be smooth (in the intuitive 
sense), since one would wish small changes in the reported distribution to produce 
only small changes in the obtained score. The mathematical condition imposed 
is a simple and convenient representation of such smoothness. 
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We have characterised the problem faced by an individual reporting his or her 
beliefs about conflicting “hypotheses” as a problem of choice among probability 
distributions over {E;, 7 € J}, with preferences described by a score function. 
This is a well specified problem, whose solution, in accordance with our devel- 
opment based on quantitative coherence, is to report that distribution q which 
maximises the expected utility 


S~u(q, Ej) P(E; | D). 


jeJ 


In order to ensure that a coherent individual is also honest, we need a form of 
u(.) which guarantees that the expected utility is maximised if, and only if, q; = 
p; = P(E;| D), for each j; otherwise, the individual's best policy could be to 
report something other than his or her true beliefs. This motivates the following 
definition: 


Definition 2.21. (Proper score function). A score function u is proper if, for 
each strictly positive probability distribution p = {p;, j € J} defined over a 
partition {E;,j € J}, 


sup S_ u(q. Ej)p) ) = D_ up. E))p;- 


jéeJ Jed 


where the supremum, taken over the class Q of all probability distributions 
over {E,, 7 € J}, is attained if, and only if. q = p. 


It would seem reasonable that, ina scientific inference context, one should require 
a score function to be proper. Whether a scientific report presents the inference 
of a single scientist or a range of inferences, purporting to represent those that 
might be made by some community of scientists, we should wish to be reassured 
that any reported inference could be justified as a genuine current belief. 


Smooth, proper score functions have been successfully used in practice in the 
following contexts: (i) to determine an appropriate fee to be paid to meteorologists 
in order to encourage them to report reliable predictions (Murphy and Epstein, 
1967); (ii) to score multiple choice examinations so that students are encouraged 
to assign, over the possible answers, probability distributions which truly describe 
their beliefs (de Finetti, 1965; Bernardo, 1981b, Section 3.6); (iii) to devise general 
procedures to elicit personal probabilities and expectations (Savage, 1971); (iv) 
to select best subsets of variables for prediction purposes in political or medical 
contexts (Bernardo and Bermiidez, 1985). 

The simplest proper score function is the quadratic function (Brier, 1950; de 
Finetti, 1962) defined as follows. 
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Definition 2.22. (Quadratic score function). A quadratic score function for 
probability distributions q = {q;.j € J} defined over apartition { Ej. j € J} 
is any function of the form 


u{g.b,} =A {2 - re +B, A>0. 


red 
where q = {q,. j € J} is any probability distribution over { EF. 3 © J}. 


Using the indicator function for f;. 14:;, an alternative expression for the 
quadratic score function is given by 


Hae = a4) -Soa@- uy b em, A>0. 
wed 

which makes explicit the role of a ‘penalty’ equal to the squared euclidean distance 

from q to a perfect prediction. 


Proposition 2.28. A quadratic score function is proper. 


Proof. We have to maximise, over q. the expected score 


- u{q.E,}p, = S~ {4 (2 - ys) + a, pj. 
Jed ged wed 

Taking derivatives with respect to the q,'s and equating them to zero, we have the 

system of equations 2p; — 2q;{>°, pe} = 0.) € J. and since >, p; = 1. we have 

q, = pj forall j. It is easily checked that this gives amaximum. = q 


Note that in the proof of Proposition 2.28 we did nor need to use the condition 
X, @ = 1: this is a rather special feature of the quadratic score function. 


A further condition is required for score functions in contexts. which we shall 
refer to as “pure inference problems”. where the value of a distribution. q. is only 
to be assessed in terms of the probability it assigned to the actual outcome. 


Definition 2.23. (Local score function). A score function wu is local if. for 
each element q = {q,. j € J} of the class Q of probability distributions 
defined over a partition {Ej. j € J}. there exist functions {u,(.). 7 € J} 
such that u{q. E,} = a,(q;). 


It is intuitively clear that the preferences of an individual scientist faced with a 
pure inference problem should correspond to the ordering induced by a local score 
function. The reason for this is that. by definition, in a “pure” inference problem 
we are solely concerned with “the truth”. It ts therefore natural that if E,, say. turns 
out to be true, the individual scientist should be assessed (1.e.. scored) only on the 
basis of his or her reported judgement about the plausibility of F.,. 
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This can be contrasted with the forms of “score” function that would typically 
be appropriate in more directly practical contexts. In stock control, for example, 
probability judgements about demand would usually be assessed in the light of the 
relative Seriousness of under- or over-stocking, rather than by just concentrating 
on the belief previously attached to what turned out to be the actual level of 
demand. 


Note that, in Definition 2.23, the functional form u;(p,) of the dependence 
of the score on the probability attached to the true E; is allowed to vary with the 
particular E; considered. By permitting different u,(.)’s for each Ej, we allow for 
the possibility that “bad predictions” regarding some “truths” may be judged more 
harshly than others. 

The situation described by a local score function is, of course, an idealised, 
limit situation, but one which seems, at least approximately, appropriate in reporting 
pure scientific research. In addition, later in this section we shall see that certain 
well-known criteria for choosing among experimental designs are optimal if, and 
only if, preferences are described by a smooth, proper, local score function. 


Proposition 2.29. (Characterisation of proper local score functions). If u 
is a smooth, proper, local score function for probability distributions q = 
{qj € J} defined over a partition {E,, 7 € J} which contains more than 
two elements, then it must be of the form u{q, E;} = Alogq, + B;, where 
A > Oand the B;'s are arbitrary constants. 


Proof. Since u(.) is local and proper, then for some {u,(.), 7 € J}, we must 


have 
sup ¥_ u(q, E,) pj = sup > u,(q)) Pj = >, uy(P5) Py, 
q jéeJ q jel 76d 


where p; > 0, 3 pj; = 1 and the supremum is taken over the class of aaa 
distributions g = (aj. EJ), 4 20, 059; =1. 
Writing p = {p;,po,...} and g = (an, qz,.--}, with 


pi=1-)>-p, n=1-Soq, 
jol j>l 


we seek {u,(.).j € J}, giving an extremal of 


F{q2,93,..-} = (: >»,] Uy (: sae Ya) + 3° pjuj(q), 


jel jel j>l 


For {q2, q3,...} to make F stationary it is necessary (see e.g. Jeffreys and Jeffreys, 
1946, p. 315) that 


6) 
—F 23 Brees | =0 
Fa {qo + a€2,q3 + a€3,...} oe 
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for any ¢ = {€2,£3,...} such that all the ¢, are sufficiently small. Calculating this 
derivative, the condition is seen to reduce to 


5 {(-Es) «(-E0) sia} 
jel r>} gol 


for all <,'s sufficiently small, where u’ stands for the derivative of u. Moreover, 
since wu is proper, {p2.p3....} must be an extremal of F and thus we have the 
system of equations 


p, uS(p;) = (: - >>] uy (: = >] ae ee eee 
ul 


i>l 


so that all the functions u;.j = 1.2.... satisfy the same functional equation, 
namely 
py ul(pj) = pr uy(pi). fj = 2.3... 


for all {p2, ps, ...} and, hence, 
pui(p) =A. O<pst. forall j = 1.2.... 


so that uj(p) = Alog p+ B,. The condition A > 0 suffices to guarantee that the 
extremal found is indeed amaximum. = _q 


Definition 2.24. (Logarithmic score function). A logarithmic score function 
for strictly positive probability distributions q = {q;.3 € J} defined over a 
partition {Ej, j € J} is any function of the form 


u{q. E,} = Alogq;+B,. A>0O. 


If the partition { Ej, j € J} only contains two elements, so that the partition 
is simply {H, H‘}, the locality condition is, of course, vacuous. In this case, 
u{q. Ej} = u{(qa.lb-—m). la} = f(qi.la#). say, where 1, is the indicator 
function for H, and the score function only depends on the probability q, attached 
to H, whether or not H occurs. 


For u{(q:, } — gi), ly } to be proper we must have 


sup {Pi f(q1.1) + Cl — pi) f (qi. 0D} = an FCP. 1) + (1 — pi) f(py.0) 


4g, €/0.1] 
so that, if the score function is smooth, then f must satisfy the functional equation 


xe f'(e 1) + (1 ~ x) f'(2.0) = 0. 


2.7 Inference and Information 75 


The logarithmic function f(z.1) = Alogz + By, f(x,0) = Alog(1 — x) + By is 
then just one of the many possible solutions (see Good, 1952). 

We have assumed that the probability distributions to be considered as options 
assign strictly positive q; to each Ej. This means that, given any particular q € Q, 
we have no problem in calculating the expected utility arising from the logarithmic 
score function. It is worth noting, however, that since we place no (strictly positive) 
lower bound on the possible g;, we have an example of an unbounded decision 
problem; i.e., a decision problem without extreme consequences. 


2.7.3 Approximation and Discrepancy 


We have argued that the optimal solution to an inference reporting problem (either 
for an individual, or for each of several individuals) is to state the appropriate actual 
beliefs, p, say. From a technical point of view, however, particularly within the 
mathematical extensions to be considered in Chapter 3, the precise computation of 
p may be difficult and we may choose instead to report an approximation to our 
beliefs, q, say, on the grounds that q is “close” to p, but much easier to calculate. 
The justification of such a procedure requires a study of the notion of “closeness” 
between two distributions. 


Proposition 2.30. (Expected loss in probability reporting). If preferences 
are described by a logarithmic score function, the expected loss of utility in 
reporting a probability distribution q = {q;. j € J} defined over a partition 
{E;, j € J}, rather than the distribution p = {p;,j € J} representing 
actual beliefs, is given by 


5{q|p} = A>_ pj log (p;/q)). A>. 


jeJ 
Moreover, 6{q |p} > 0 with equality if and only if q = p. 


Proof. Using Definition 2.24, the expected utility of reporting q when p is the 
actual distribution of beliefs is @(q) = }° ,{ A log q; + Bj}p;, and thus 


5{q|p} = Up) — u(q) 
{(A log pj + B;) — (Alogg; + B,)} p, = A>-p, log . 


Jed jéJd 

The final statement in the theorem is a consequence of Proposition 2.29 since, 
because the logarithmic score function is proper, the expected utility of reporting q 
is maximised if, and only if, g = p, so that U(p) > %(q), with equality if, and only 
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if, p = g. An immediate direct proof is obtained using the fact that for all r > 0, 
log.c < 2 — 1 with equality if, and only if, z = 1. Indeed, we then have 


~d{qip} = >_p, log © 


jes d 


SO p(ai/pi) - 1} = Dou - Sop, =1-1=0. 
J ped 


Jed 
with equality if, and only if.q; = p, forall j.  g 


The quantity 6{q |p}, which arises here as a difference between two expected 
utilities, was introduced by Kullback and Leibler (1951) as an ad hoc measure of 
(directed) divergence between two probability distributions. 


Combining Propositions 2.29 and 2.30, it is clear that an individual with 
preferences approximately described by a proper local score function should beware 
of approximating by zero. This reflects the fact that the “tails” of the distribution 
are, generally speaking, extremely important in pure inference problems. This 
is in contrast to many practical decision problems where the form of the utility 
function often makes the solution robust with respect to changes in the “tails” of 
the distribution assumed. 

Proposition 2.30 suggests a natural, general measure of “lack of fit”, or discrep- 
ancy, between a distribution and an approximation, when preferences are described 
by a logarithmic score function. 


Definition 2.25. (Discrepancy of an approximation). The discrepancy be- 
tween a strictly positive probability distribution p = {p,. j € J} over a 
partition {E,, j € J} and an approximation p = {p,. j € J} is defined by 


5{p|p} = )_p, log! 


Pa, 
Vel P; 


Example 2.5. (Poisson approximation to a binomial distribution). The behaviour 
of 6{p |p} is well illustrated by a familiar, elementary example. Consider the binomial 
distribution 

P= (“Jor ~Oy?. Fs... ne 
J 


= 0. otherwise 


and let 
gy! 
Pp, = expt ng CE, J=O1... 
J: 
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0.057 5{Poisson | Binomial } te 
n=2 
n= 10 
6 
0.00 0.25 


Figure 2.7 Discrepancy between a binomial distribution and its Poisson approxima: 
tion (logarithms to base 2). 


be its Poisson approximation. It is apparent from Figure 2.7 that 6{p | p} decreases as either 
n increases or @ decreases, or both, and that the second factor is far more important than the 
first. However, it follows from our previous discussion that it would not be a good idea to 
reverse the roles and try to approximate a Poisson distribution by a binomial distribution. 


When, as in Figure 2.7, logarithms to base 2 are used, the utility and discrepancy 
are measured on the well-known scale of bits of information (or entropy), which 
can be interpreted in terms of the expected number of yes-no questions required 
to identify the true event in the partition (see, for example, de Finetti, 1970/1974, 
p. 103, or Renyi, 1962/1970, p. 564). 


Clearly, Definition 2.25 provides a systematic approach to approximation in 
pure inference contexts. The best approximation within a given family will be that 
which minimises the discrepancy. 


2.7.4 Information 


In Section 2.4.2, we showed that, for quantitative coherence, any new information D 
should be incorporated into the analysis by updating beliefs via Bayes’ theorem, so 
that the initial representation of beliefs P(.) is updated to the conditional probability 
measure P(.| D). In Section 2.7.2, we showed that, within the context of the pure 
inference reporting problem, utility is defined in terms of the logarithmic score 
function. 


78 2 Foundations 


Proposition 2.31. (Expected utility of data). If preferences are described by 
a logarithmic score function for the class of probability distributions defined 
over a partition {E;. j € J}. then the expected increase in utility provided 
by data D, when the initial probability distribution { P(F,).j € J} is strictly 
positive, is given by 


es P(E, | D) 
Ad) PUE)| D) log rey 


jel 
where A > 0 is arbitrary, and {P(E,|D).j € J} is the conditional prob- 


ability distribution, given D. Moreover, this expected increase in utility is 
non-negative and is zero if, and only if, P(E, | D) = P(E;) for all j. 


Proof. By Definition 2.24, the utilities of reporting P(.) or P(.| D), were 
FE, known to be true, would be Alog P(E;) + B; and Alog P(E, | D) + B,. 
respectively. Thus, conditional on D, the expected increase in utility provided by 
D is given by 


S_{(A log P(E, | D) + Bj) — (Alog P(E;) + B))} P(E, | D) 


1d 


P(E,|D)_ 


= AD PE; | D) los es 
x) 


ged 


which, by Proposition 2.30, is non-negative and is zero if and only if, for all j. 


In the context of pure inference problems, we shall find it convenient to un- 
derline the fact that, because of the use of the logarithmic score function, utility 
assumes a special form and establishes a link between utility theory and classical 
information theory. This motivates Definitions 2.26 and 2.27. 


Definition 2.26. (Information from data). The amount of information about 
a partition {E,, j € J} provided by the data D, when the initial distribution 
over {E,. j € J} is py = {P(E,).j € J}. is defined to be 


. P(E,|D 
I(D| py) = > P(E; | D) log ae 
U 


ped oe 


where {P(E;|D). 3 € J} is the conditional probability distribution given 
the data D. 
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It follows from Definition 2.26 that the amount of information provided by 
data D is equal to 6(pp | pp), the discrepancy measure if py = {P(E;),j € J} is 
considered as an approximation to pp = { P(E; | D).7 € J}. Another interesting 
interpretation of I(D | po) arises from the following analysis. Conditional on £,,, 
log P(E;) and log P(E; | D) measure, respectively, how good the initial and the 
conditional distributions are in “predicting” the “true hypothesis” E; = H;, so 
that log P(E; | D) — log P(E;) is a measure of the value of D, were E; known to 
be true; [(D|p,) is simply the expected value of that difference calculated with 
respect to pp. 

It should be clear from the preceding discussion that /(D | py) measures indi- 
rectly the information provided by the data in terms of the changes produced in the 
probability distribution of interest. The amount of information is thus seen to be 
a relative measure, which obviously depends on the initial distribution. Attempts 
to define absolute measures of information have systematically failed to produce 
concepts of lasting value. 


In the finite case, the entropy of the distribution p = {p,..... Pu}, defined by 


H{p} = - > p, logp,. 


jm) 


has been proposed and widely accepted as an absolute measure of uncertainty. 
The recognised fact that its apparently natural extension to the continuous case 
does not make sense (if only because it is heavily dependent on the particular 
parametrisation used) should, however, have raised doubts about the universality 
of this concept. The fact that, in the finite case, H {p} as a measure of uncertainty 
(and —H {p} as a measure of “absolute” information) seems to work correctly is 
explained (from our perspective) by the fact that 


Yop, log 2s =logn ~ H{p}, 


JF) 


so that, in terms of the above discussion, — H {p} may be interpreted, apart from an 
unimportant additive constant, as the amount of information which is necessary 
to obtain p = {pi..... p.} from an initial discrete uniform distribution (see 
Section 3.2.2), which acts as an “origin” or “reference” measure of uncertainty. 
As we shall see in detail later, the problem of extending the entropy concept 
to continuous distributions is closely related to that of defining an “origin” or 
“reference” measure of uncertainty in the continuous case, a role unambiguously 
played by the uniform distribution in the finite case. For detailed discussion of 
H{p} and other proposed entropy measures, see Renyi (1961). 


We shall on occasion wish to consider the idea of the amount of information 
which may be expected from an experiment e, the expectation being calculated 
before the results of the experiment are actually available. 
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Definition 2.27. (Expected information from an experiment). The expected 
information to be pravided by an experiment ¢ about a partition {Ej. j € J}. 
when the initial distribution over {E;. j € J} is py = {P(E)). 7 € J}. ts 
given by 
Ke|py) = d_ MD, | po) P(D))- 
ie] 


where the possible results of the experiment e, {D,. i € I}, occur with prob- 
abilities {P(D,).i € T}. 


Proposition 2.32. An aliernative expression for the expected information is 


P(E, D;) 
SOL 
Py) =S° 3° P(E, ND.) 8 BE PID) | 


eed ped 


I(e 


where P(E; 1 D)) = P(D,) P(E,|D,). and {P(E,|D,).j € J} is the 
conditional distribution, given . occurrence of D;, corresponding to the 
initial distribution py = {P(E,).j € J}. Moreover. [(e| py) > 0. with 
equality if and only if. for all E; and D,, P(E, ND) = P(E))P(D)). 


Proof. Let gq, = P(D,), p, = P(E,) and p,, = P(E,!D,). Then. by 
Definition 2.27, 
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and the result now follows from the fact that, by Bayes’ theorem, 
P(E, ND) = P(E, | D,)P(D,) = PuGi- 


Since, by Proposition 2.31, {(D; | py) > 0 with equality iff. P(E, | D,) = P(E,). 
it follows from Definition 2.27 that [(e | py) > 0 with equality if, and only if, for 
all. £, and D;, P(E) 0 Dj) = P(E,)P(D,). 4 


The expression for [(¢ |p q) given by Proposition 2.32 is Shannon's (1948) 
measure of expected information. We have thus found, in a decision theoretical 
framework, a natural interpretation of this famous measure of expected information: 
Shannon's expected information is the expected utility provided by an experiment 
in a pure inference context, when an individual's preferences are described by a 
smooth, proper, local score function. 

In conclusion, we have suggested that the problem of reporting inferences 
can be viewed as a particular decision problem and thus should be analysed within 
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the framework of decision theory. We have established that, with a natural char- 
acterisation of an individual’s utility function when faced with a pure inference 
problem, preferences should be described by a logarithmic score function. We 
have also seen that, within this framework, discrepancy and amount of information 
are naturally defined in terms of expected loss of utility and expected increase in 
utility, respectively, and that maximising expected Shannon information is a par- 
ticular instance of maximising expected utility. We shall see in Section 3.4 how 
these results, established here for finite partitions, extend straightforwardly to the 
continuous case. 


2.8 DISCUSSION AND FURTHER REFERENCES 
2.8.1 Operational Definitions 


In everyday conversation, the way in which we use language is typically rather in- 
formal and unselfconscious, and we tolerate each other’s ambiguities and vacuities 
for the most part, occasionally seeking an ad hoc clarification of a particular state- 
ment or idea if the context seems to justify the effort required in trying to be a little 
more precise. (For a detailed account of the ambiguities which plague qualitative 
probability expressions in English, see Mosteller and Youtz, 1990.) 

In the context of scientific and philosophical discourse, however, there is 
a paramount need for statements which are meaningful and unambiguous. The 
everyday, tolerant, ad hoc response will therefore no longer suffice. More rigorous 
habits of thought are required, and we need to be selfconsciously aware of the 
precautions and procedures to be adopted if we are to arrive at statements which 
make sense. 

A prerequisite for “making sense” is that the fundamental concepts which 
provide the substantive content of our statements should themselves be defined in 
an essentially unambiguous manner. We are thus driven to seek for definitions of 
fundamental notions which can be reduced ultimately to the touchstone of actual 
or potential personal experience, rather than remaining at the level of mere words 
or phrases. 

This kind of approach to definitions is closely related to the philosophy of 
pragmatism, as formulated in the second half of the nineteenth century by Peirce, 
who insisted that clarity in thinking about concepts could only be achieved by con- 
centrating attention on the conceivable practical effects associated with a concept, 
or the practical consequence of adopting one form of definition rather than another. 
In Peirce (1878), this point of view was summarised as follows: 


Consider what effects, that might conceivably have practical bearings, we con- 
ceive the object of our conception to have. Then, our conception of these effects 
is the whole of our conception of the object. 
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In some respects, however, this position is not entirely satisfactory in that it fails 
to go far enough in elaborating what is to be understood by the term “practical”. This 
crucial elaboration was provided by Bridgman (1927) in a book entitled The Logic 
of Modern Physics, where the key idea of an operational definition is introduced 
and illustrated by considering the concept of “length”: 


... what do we mean by the length of an object? We evidently know what we 
mean by length if we can tell what the length of any and every object is and for 
the physicist nothing more is required. To find the length of an object. we have 
to perform certain physical operations. The concept of length is therefore fixed 
when the operations by which length is measured are fixed: that is. the concept 
of length involves as much as. and nothing more. than the set of operations by 
which length is determined. In general. we mean by any concept nothing more 
than a set of operations: the concept is synonymous with the corresponding set 


of operations. If the concept is physical. ... the operations are actual physical 
measurements ... ; or if the concept is mental. ... the operations are mental 
operations. .. 


Throughout this work, we shall seek to adhere to the operational approach to 
defining concepts in order to arrive at meaningful and unambiguous statements in 
the context of representing beliefs and taking actions in situations of uncertainty. 
Indeed, we have stressed this aspect of our thinking in Sections 2.1 to 2.7, where we 
made the practical, operational idea of preference between options the fundamental 
starting point and touchstone for all other definitions. 

We also noted the inevitable element of idealisation, or approximation, implicit 
in the operational approach to our concepts. and we remarked on this at several 
points in Section 2.3. Since many critics of the personalistic Bayesian viewpoint 
claim to find great difficulty with this feature of the approach, often suggesting 
that it undermines the entire theory, it is worth noting Bridgman’s very explicit 
recognition that a// experience is subject to error and that all we can do is to take 
sufficient precautions when specifying sets of operations to ensure that remaining 
unspecified variations in procedure have negligible effects on the results of interest. 
This is well illustrated by Bridgman’s account of the operational concept of length 
and its attendant idealisations and approximations: 


...We take a measuring rod. lay it on the object so that one of its ends coincides 
with one end of the object. mark on the object the position of the rod. then move 
the rod along in a straight line extension of its previous position until the first 
end coincides with the previous position of the second end, repeat this process as 
often as we can, and call the length the total number of times the rod was applied. 
This procedure, apparently so simple. is in practice exceedingly complicated. and 
doubtless a full description of all the precautions that must be taken would fill a 
large treatise. We must. for example. be sure that the temperature of the rod is 
the standard temperature at which its length is defined. or clse we must make a 
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correction for it; or we must correct for the gravitational distortion of the rod if 
we measure a vertical length; or we must be sure that the rod is not a magnet or is 
not subject to electrical forces ... we must go further and specify all the details 
by which the rod is moved from one position to the next on the object, its precise 
path through space and its velocity and acceleration in getting from one position 
to another. Practically, of course, precautions such as these are not taken, but 
the justification is in our experience that variations of procedure of this kind are 
without effect on the final result. . . 


This pragmatic recognition that there are inevitable limitations in any concrete 
application of a set of operational procedures is precisely the spirit of our discussion 
of Axioms 4 and 5 in Section 2.3. In practical terms, we have to stop somewhere, 
even though, in principle, we could indefinitely refine our measurement operations. 
What matters is to be able to achieve sufficient accuracy to avoid unacceptable 
distortion in any analysis of interest. 


2.8.2 Quantitative Coherence Theories 


In a comprehensive review of normative decision theories leading to the expected 
utility criterion, Fishburn (198!) lists over thirty different axiomatic formulations 
of the principles of coherence, reflecting a variety of responses to the underlying 
conflict between axiomatic simplicity and structural flexibility in the representation 
of decision problems. Fishburn sums up the dilemma as follows: 


On the one hand, we would like our axioms to be simple, interpretable, intu- 
itively clear, and capable of convincing others that they are appealing criteria 
of coherency and consistency in decision making under uncertainty, but to do 
this it seems essential to invoke strong structural conditions. On the other hand, 
we would like our theory to adhere to the loose structures that often arise in 
realistic decision situations, but if this is done then we will be faced with fairly 
complicated axioms that accommodate these loose structures. 


In addition, we should like the definitions of the basic concepts of probability 
and utility to have strong and direct links with practical assessment procedures, in 
conformity with the operational philosophy outlined above. 

With these considerations in mind, our purpose here is to provide a brief 
historical review of the foundational writings which seem to us the most significant. 
This will serve in part to acknowledge our general intellectual indebtedness and 
orientation, and in part to explain and further motivate our own particular choice 
of axiom system. 

The earliest axiomatic approach to the problem of decision making under 
uncertainty is that of Ramsey (1926), who presented the outline of a formal system. 
The key postulate in Ramsey’s theory is the existence of a so-called ethically neutral 
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event E, say, which, expressed in terms of our notation for options, has the property 
that {c, | E,c2|E°} ~ {c, | E%.cy| E}. for any consequences ¢), cz. It is then 
rather natural to define the degree of belief in such an event to be 1/2 and. from 
this quantitative basis, it is straightforward to construct an operational measure of 
utility for consequences. This, in turn, is used to extend the definition of degree of 
belief to general events by means of an expected utility model. 

From a conceptual point of view, Ramsey’s theory seems to us, as indeed it 
has to many other writers, a revolutionary landmark in the history of ideas. From 
a mathematical point of view, however, the treatment is rather incomplete and it 
was not until 1954, with the publication of Savage’s (1954) book The Foundations 
of Statistics that the first complete formal theory appeared. No mathematical com- 
pletion of Ramsey's theory seems to have been published, but a closely related 
development can be found in Pfanzagl (1967, 1968). 

Savage’s major innovation in structuring decision problems is to define what 
he calls acts (options, in our terminology) as functions from the set of uncertain 
possible outcomes into the set of consequences. His key coherence assumption 
is then that of a complete, transitive order relation among acts and this is used 
to define qualitative probabilities. These are extended into quantitative probabili- 
ties by means of a “continuously divisible” assumption about events. Utilities are 
subsequently introduced using ideas similar to those of von Neumann and Mor- 
genstern (1944/1953), who had, ten years earlier, presented an axiom system for 
utility alone, assuming the prior existence of probabilities. 

The Savage axiom system is a great historical achievement and provides the 
first formal justification of the personalistic approach to probability and decision 
making; for a modern appraisal see Shafer (1986) and lively ensuing discussion. 
See, also, Hens (1992). Of course, many variations on an axiomatic theme are pos- 
sible and other Savage-type axiom systems have been developed since by Stigum 
(1972), Roberts (1974), Fishburn (1975) and Narens (1976). Suppes (1956) pre- 
sented a system which combined elements of Savage’s and Ramsey's approaches. 
See, also, Suppes (1960, 1974) and Savage (1970). There are, however, two major 
difficulties with Savage’s approach, which impose severe limitations on the range 
of applicability of the theory. 

The first of these difficulties stems from the “continuously divisible” assump- 
tion about events, which Savage uses as the basis for proceeding from qualitative 
to quantitative concepts. Such an assumption imposes severe constraints on the 
allowable forms of structure for the set of uncertain outcomes: in fact. it even 
prevents the theory from being directly applicable to situations involving a finite or 
countably infinite set of possible outcomes. 

One way of avoiding this embarrassing structural limitation is to introduce 
a quantitative element into the system by a device like that of Ramsey's ethically 
neutral event. This is directly defined to have probability 1/2 and thus enables 
Ramsey to get the quantitative ball rolling without imposing undue constraints on 
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the structure. All he requires is that (at least) one such event be included in the 
representation of the uncertain outcomes. In fact, a generalisation of Ramsey’s 

‘idea re-emerges in the form of canonical lotteries, introduced by Anscombe and 
Aumann (1963) for defining degrees of belief, and by Pratt, Raiffa and Schlaifer 
(1964, 1965) as a basis for simultaneously quantifying personal degrees of belief 
and utilities in a direct and intuitive manner. 

The basic idea is essentially that of a standard measuring device, in some sense 
external to the real-world events and options of interest. It seems to us that this 
idea ties in perfectly with the kind of operational considerations described above, 
and the standard events and options that we introduced in Section 2.3 play this 
fundamental operational role in our own system. Other systems using standard 
measuring devices (sometimes referred to as external scaling devices) are those of 
Fishburn (1967b, 1969) and Balch and Fishburn (1974). A theory which, like ours, 
combines a standard measuring device with a fundamental notion of conditional 
preference is that of Luce and Krantz (1971). 

The second major difficulty with Savage’s theory, and one that also exists in 
many other theories (see Table I in Fishburn, 1981), is that the Savage axioms 
imply the boundedness of utility functions (an implication of which Savage was 
apparently unaware when he wrote The Foundations of Statistics, but which was 
subsequently proved by Fishburn, 1970). The theory does not therefore justify 
the use of many mathematically convenient and widely used utility functions; for 
example, those implicit in forms such as “quadratic loss” and “logarithmic score”. 

We take the view, already hinted at in our brief discussion of medical and 
monetary consequences in Section 2.5, that it is often conceptually and mathemat- 
ically convenient to be able to use structural representations going beyond what we 
perceive to be the essentially finitistic and bounded characteristics of real-world 
problems. And yet, in presenting the basic quantitative coherence axioms it is 
important not to confuse the primary definitions and coherence principles with the 
secondary issues of the precise forms of the various sets involved. For this reason, 
we have so far always taken options to be defined by finite partitions; indeed, within 
this simple structure, we hope that the essence of the quantitative coherence theory 
has already been clearly communicated, uncomplicated by structural complexities. 
Motivated by considerations of mathematical convenience, however, we shall, in 
Chapter 3, relax the constraint imposed on the form of the action space. We shall 
then arrive at a sufficiently general setting for all our subsequent developments and 
applications. 


2.8.3 Related Theories 


Our previous discussion centred on complete axiomatic approaches to decision 
problems, involving a unified development of both probability and utility concepts. 
In our view, a unified treatment of the two concepts is inescapable if operational 
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considerations are to be taken seriously. However. there have been a number of at- 
tempted developments of probability ideas separate from utility considerations. as 
well as separate developments of utility ideas presupposing the existence of prob- 
abilities. In addition, there is a considerable literature on information-theoretic 
ideas closely related to those of Section 2.7. In this section. we shall provide a 
summary overview of a number of these related theories. grouped under the fol- 
lowing subheadings: (i) Monetary Bets and Degrees of Belief. (ii) Scoring Rules and 
Degrees of Belief, (iii) Axiomatic Approaches to Degrees of Belief. (iv) Axiomatic 
Approaches to Utilities and (v) Information Theories. 

For the most part, we shall simply give what seem to us the most important 
historical references. together with some brief comments. The first two topics will. 
however, be treated at greater length; partly because of their close relation with 
the main concerns of this book. and partly because of their connections with the 
important practical topic of the assessment of beliefs. 


Monetary Bets and Degrees of Belief 


An elegant demonstration that coherent degrees of belief satisfy the rules of (finitely 
additive) probability was given by de Finetti (1937/1964), without explicit use of 
the utility concept. Using the notation for options introduced in Section 2.3. de 
Finetti’s approach can be summarised as follows. 

If consequences are assumed to be monetary. and if. given an arbitrary mone- 
tary sum 7m and uncertain event £, an individual’s preferences among options are 
such that {pi |Q} ~ {an | E.0| E°}. then the individual's degree of belief in F& is 
defined to be p. 

This definition is virtually identical to Bayes’ own definition of probability 
(see our later discussion under the heading of Axiomatic Approaches tu Degrees of 
Belief). In modern economic terminology, probability can be considered to be a 
marginal rate of substitution or, more simply. a kind of “price”. 

Given that an individual has specified his or her degrees of belief for some 
collection of events by repeated use of the above definition. either it is possible 
to arrange a form of monetary bet in terms of these events which is such that the 
individual will certainly lose, a so-called “Dutch book”, or such an arrangement is 
impossible. In the latter case, the individual is said to have specified a coherent set 
of degrees of belief. It is now straightforward to verify that coherent degrees of 
belief have the properties of finitely additive probabilities. 

To demonstrate that 0 < p < |. forany FE and i, we can argue as follows. An 
individual who assigns p > 1 is implicitly agreeing to pay a stake larger than mm to 
enter a gamble in which the maximum prize he or she can win is rr, an individual 
who assigns p < (0) is implicitly agreeing to offer a gamble in which he or she will 
pay out either m or nothing in return for a negative stake, which is equivalent to 
paying an opponent to enter such a gamble. In either case. a bet can be arranged 
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which will result in a certain loss to the individual and avoidance of this possibility 
requires that 0 < p< 1. 

To demonstrate the additive property of degrees of belief for exclusive and 
exhaustive events, E,, E2,....£,, we proceed as follows. If an individual specifies 
Di, P2,---+Pn, to be his or her degrees of belief in those events, this is an implicit 
agreement to pay a total stake of p;7m, + po7z + --- + ppm, in order to enter a 
gamble resulting in a prize of mm, if E; occurs and thus a “gain”, or “net return”, 
of g) =m; - ve jPjMj. which could, of course, be negative. In order to avoid the 
possibility of the m,’s being chosen in such a way as to guarantee the negativity 
of the g;’s for fixed p;’s in this system of linear equations, it is necessary that the 
determinant of the matrix relating the 7,’s to the g;’s be zero so that the linear 
system cannot be solved; this turns out to require that p) + pp +--+ +p, = 1. 
Moreover, it is easy to check that this is also a sufficient condition for coherence: 
it implies > j Pi9) = 9, for any choice of the 7,’s, and hence the impossibility of 
all the returns being negative. 

The extension of these ideas to cover the revision of degrees of belief con- 
ditional on new information proceeds in a similar manner, except that an indi- 
vidual’s degree of belief in an event FE conditional on an event F is defined to 
be the number q such that, given any monetary sum 7m, we have the equivalence 
{qm|Q} ~ {m| EN F,0| E° 9 F,qm| F*}, according to the individual's pref- 
erence ordering among options. The interpretation of this definition is straightfor- 
ward: having paid a stake of gm, if F occurs we are confronted with a gamble with 
prizes m. if E occurs, and nothing otherwise; if F' does not occur the bet is “called 
off” and the stake returned. 

However, despite the intuitive appeal of this simple and neat approach, it has 
two major shortcomings from an operational viewpoint. 

In the first place, it is clear that the definitions cannot be taken seriously in 
terms of arbitrary monetary sums: the “perceived value” of a stake or a return is 
not equivalent to its monetary value and the missing “‘utility” concept is required in 
order to overcome the difficulty. This point was later recognised by de Finetti (see 
Kyburg and Smokler, 1964/1980, p. 62, footnote (a)), but has its earlier origins 
in the celebrated St. Petersburg paradox (first discussed in terms of utility by 
Daniel Bernoulli, 1730/1954). For further discussion of possible forms of “utility 
for money”, see, for example, Pratt (1964), Lavalle (1968), Lindley (1971/1985, 
Chapter 5) and Hull ef a/. (1973). Additionally. one may explicitly recognise that 
some people have a positive utility for gambling (see, for instance, Conlisk, 1993). 

An ad hoc modification of de Finetti’s approach would be to confine attention 
to “small” stakes (thus, in effect, restricting attention to a range of outcomes over 
which the “utility” can be taken as approximately linear) and the argument, thus 
modified, has considerable pedagogical and, perhaps, practical use, despite its rather 
informal nature. A more formal argument based on the avoidance of certain losses 
in betting formulations has been given by Freedman and Purves (1969). Related 
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arguments have also been used by Cornfield (1969). Heath and Sudderth (1972) 
and Buehler (1976) to expand on de Finetti’s concept of coherent systems of bets. 

In addition to the problem of “non-linearity in the face of risk”. alluded to 
above, there is also the difficulty that unwanted game-theoretic elements may enter 
the picture if we base a theory on ideas such as “opponents” choosing the levels of 
prizes in gambles. For this reason. de Finetti himself later preferred to use an ap- 
proach based on scoring rules, a concept we have already introduced in Section 2.7. 


Scoring Rules and Degrees of Belief 


The scoring rule approach to the definition of degrees of belief and the derivation of 
their properties when constrained to be coherent is due to de Finetti (1963. 1964), 
with important subsequent generalisations by Savage (1971) and Lindley (1982a). 

In terms of the quadratic scoring rule. the development proceeds as follows. 
Given an uncertain event -. an individual is asked to select a number. p. with 
the understanding that if / occurs he or she is to suffer a penalty (or loss) of 
L = (1 — p)’. whereas if E does not occur he or she is to suffer a penalty of 
L = p*. Using the indicator function for E. the penalty can be written in the 
general form. L = (1g — p)*. The number, p. which the individual chooses is 
defined to be his or her degree of belief in F. 

Suppose now that £). £..... £;, are an exclusive and exhaustive collection of 
uncertain events for which the individual, using the quadratic scoring rule scheme. 
has to specify degrees of belief pi. p...... Pi. respectively, subject now to the 
penalty 

L=(lg, —mi)? + (le, mt + (le, — Dal”. 

Given a specification, p;.p2..... P»- either it is possible to find an alternative 

specification. gi. qz..... qn. Say. such that 


S-(e, ~ a)? < 2U1e, - vi. 
s=1 sm} 


for any assignment of the value 1 to one of the F;’s and () to the others. or it is 
not possible to find such q,.q2.....q,- In the latter case. the individual is said 
to have specified a coherent set of degrees of belief. The underlying idea in this 
development is clearly very similar to that of de Finetti's (1937/1964) approach 
where the avoidance of a “Dutch book” is the basic criterion of coherence. 

A simple geometric argument now establishes that. for coherence we must 
have O < p, < 1. for? = 1.2..... nvand py + po+---+p, = 1. To see this, note 
that the rn logically compatible assignments of values 1 and 0 to the £,'s define 
n points in ‘R". Thinking of py. po...-. P, as defining a further point in R”, the 
coherence condition can be reinterpreted as requiring that this latter point cannot 
be moved in such a way as to reduce the distance from all the other 7 points. This 


2.8 Discussion and Further References 89 


means that pi, P2,.. . ; Pn Must define a point in the convex hull of the other 7 points, 
thus establishing the required result. 

The extension of this approach to cover the revision of degrees of belief con- 
ditional on new information proceeds as follows. An individual’s degree of belief 
in an event E conditional on the occurrence of an event F is defined to be the 
number g, which he or she chooses when confronted with a penalty defined by 
L = 1r(1g — q)?. The interpretation of this penalty is straightforward. Indeed, 
if F occurs, the specification of g proceeds according to the penalty (lz — 9); 
if F does not occur, there is no penalty, a formulation which is clearly related to 
the idea of “called-off” bets used in de Finetti’s 1937 approach. Suppose now 
that, in addition to the conditional degree of belief g, the numbers p and r are the 
individual’s degrees of belief, respectively, for the events EM F and F’, specified 
subject to the penalty 


L= \p(1g -@)? + (le lr - p)? + (Ir - 1). 

To derive the constraints on p, q and r imposed by coherence, which de- 
mands that no other choices will lead to a strictly smaller L, whatever the logically 
compatible outcomes of the events are, we argue as follows. 

If u, v, w, respectively, are the values which L takes in the cases where ENF, 
E°  F and F* occur, then p, q, r satisfy the equations 


u=(1-g)?+(1-p)y+(1-ry 


Us gt p+(i-r)? 
wus p + r. 

If p, q, r defined a point in ®* where the Jacobian of the transformation defined 
by the above equations did not vanish, it would be possible to move from that point 
in a direction which simultaneously reduced the values u, v and w. Coherence 
therefore requires that the Jacobian be zero. A simple calculation shows that this 
reduces to the condition g = p/7, which is, again, Bayes’ theorem. 

De Finetti’s ‘penalty criterion’ and related ideas have been critically re-exam- 
ined by a number of authors. Relevant additional references are Myerson (1979), 
Regazzini (1983), Gatsonis (1984), Eaton (1992) and Gilio (1992a). See, also, 
Piccinato (1986). 


Axiomatic Approaches to Degrees of Belief 


Historically, the idea of probability as “degree of belief” has received a great deal of 
distinguished support, including contributions from James Bernoulli (1713/1899), 
Laplace (1774/1986, 1814/1952), De Morgan (1847) and Borel (1924/1964). How- 
ever, so far as we know, none of these writers attempted an axiomatic development 
of the idea. 

The first recognisably “axiomatic” approach to a theory of degrees of belief 
was that of Bayes (1763) and the magnitude of his achievement has been clearly 
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recognised in the two centuries following his death by the adoption of the adjective 
Bayesian as a description of the philosophical and methodological developments 
which have been inspired, directly or indirectly, by his essay. 

By present day standards, Bayes’ formulation is, of course, extremely infor- 
mal, and a more formal, modern approach only began to emerge a century and a 
half later, in a series of papers by Wrinch and Jeffreys (1919, 1921). Formal axiom 
systems which whole-heartedly embrace the principle of revising beliefs through 
systematic use of Bayes’ theorem, are discussed in detail by Jeffreys (1931/1973, 
1939/1961), whose profound philosophical and methodological contributions to 
Bayesian statistics are now widely recognised: see for example, the evaluations 
of his work by Geisser (1980a), by Good (1980a) and by Lindley (1980a), in the 
volume edited by Zellner (1980). 

From a foundational perspective, however, the flavour of Jeffreys’ approach 
seems to us to place insufficient emphasis on the inescapably personal nature of de- 
grees of belief, resulting in an over-concentration on “conventional” representations 
of degrees of belief derived from “logical” rather than operational considerations 
(despite the fact that Jeffreys was highly motivated by real world applications'). 
Similar criticisms seem to us to apply to the original and elegant formal develop- 
ment given by Cox (1946, 1961) and Jaynes (1958), who showed that the probability 
axioms constitute the only consistent extension of ordinary (Aristotelian) logic in 
which degrees of belief are represented by real numbers. 

We should point out, however, that our emphasis on operational considerations 
and the subjective character of degrees of belief would, in turn, be criticised by 
many colleagues who, in other respects, share a basic commitment to the Bayesian 
approach to statistical problems. See Good (1965, Chapter 2) for a discussion of the 
variety of attitudes to probability compatible with a systematic use of the Bayesian 
paradigm. 

There are, of course, many other examples of axiomatic approaches to quanti- 
fying uncertainty in some form or another. In the finite case, this includes work by 
Kraft er al. (1959), Scott (1964), Fishburn (1970, Chapter 4), Krantz ef al. (1971), 
Domotor and Stelzer (1971), Suppes and Zanotti (1976. 1982), Heath and Sudderth 
(1978) and Luce and Narens (1978). The work of Keynes (1921/1929) and Car- 
nap (1950/1962) deserves particular mention and will be further discussed later in 
Section 2.8.4. Fishburn (1986) provided an authoritative review of the axiomatic 
foundations of subjective probability, which is followed by a long, stimulating 
discussion. See, also, French (1982) and Chuaqui and Malitz (1983). 


Axiomatic Approaches to Utilities 


Assuming the prior existence of probabilities. von Neumann and Morgenstern 
(1944/1953) presented axioms for coherent preferences which led to a justification 
of utilities as numerical measures of value for consequences and to the optimality 
criterion of maximising expected utility. Much of Savage’s (1954/1972) system 
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was directly inspired by this seminal work of von Neumann and Morgenstern and 
the influence of their ideas extends into a great many of the systems we have men- 
tioned. Other early developments which concentrate on the utility aspects of the 
decision problem include those of Friedman and Savage (1948, 1952), Marschak 
(1950), Arrow (1951a), Herstein and Milnor (1953), Edwards (1954) and Debreu 
(1960). Seminal references are reprinted in Page (1968). General accounts of 
utility are given in the books by Blackwell and Girshick (1954), Luce and Raiffa 
(1957), Chernoff and Moses (1959) and Fishburn (1970). Extensive bibliographies 
are given in Savage (1954/1972) and Fishburn (1968, 1981). 

Discussions of the experimental measurement of utility are provided by Ed- 
wards (1954), Davison et al. (1957), Suppes and Walsh (1959), Becker et a/. (1963), 
DeGroot (1963), Becker and McClintock (1967), Savage (1971) and Hull er ai. 
(1973). DeGroot (1970, Chapter 7) presents a genera] axiom system for utilities 
which imposes rather few mathematical constraints on the underlying decision 
problem structure. Multiattribute utility theory is discussed, among others, by 
Fishburn (1964) and Keeney and Raiffa (1976). Other discussions of utility theory 
include Fishburn (1967a, 1988b) and Machina (1982, 1987). See, also, Schervish 
et al. (1990). 


Information Theories 


Measures of information are closely related to ideas of uncertainty and probability 
and there is aconsiderable literature exploring the connections between these topics. 

The logarithmic information measure was proposed independently by Shannon 
(1948) and Wiener (1948) in the context of communication engineering; Lindley 
(1956) later suggested its use as a statistical criterion in the design of experiments. 
The logarithmic divergency measure was first proposed by Kullback and Leibler 
(1951) and was subsequently used as the basis for an information-theoretic approach 
to statistics by Kullback (1959/1968). A formal axiomatic approach to measures 
of information in the context of uncertainty was provided by Good (1966), who 
has made numerous contributions to the literature of the foundations of decision 
making and the evaluation of evidence. Other relevant references on information 
concepts are Renyi (1964, 1966, 1967) and Sarndal (1970). 

The mathematical results which lead to the characterisation of the logarithmic 
scoring rule for reporting probability distributions have been available for some 
considerable time. Logarithmic scores seem to have been first suggested by Good 
(1952), but he only dealt with dichotomies, for which the uniqueness result is not 
applicable. The first characterisation of the logarithmic score fora finite distribution 
was attributed to Gleason by McCarthy (1956); Aczel and Pfanzag] (1966), Arimoto 
(1970) and Savage (1971) have also given derivations of this form of scoring rule 
under various regularity conditions. 

By considering the inference reporting problem as a particular case of a de- 
cision problem, we have provided (in Section 2.7) a natural, unifying account of 
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the fundamental and close relationship between information-theoretic ideas and 
the Bayesian treatment of “pure inference” problems, Based on work of Bernardo 
(1979a), this analysis will be extended, in Chapter 3, to cover continuous distribu- 
tions. 


2.8.4 Critical issues 


We shall conclude this chapter by providing a summary overview of our position 
in relation to some of the objections commonly raised against the foundations of 
Bayesian statistics. These will be dealt with under the following subheadings: (1) 
Dynamic Frame of Discourse, (ii) Updating Subjective Probability, (iii) Relevance 
of an Axiomatic Approach, (iv) Structure of the Set of Relevant Events, (v) Pre- 
scriptive Nature of the Axioms, (vi) Precise, Complete, Quantitative Preference. 
(vit) Subjectivity of Probability, (viii) Statistical Inference as a Decision Problem 
and (ix) Communication and Group Decision Making. 


Dynamic Frame of Discourse 


As we indicated in Chapter t, our concern in this volume is with coherent beliefs 
and actions in relation to a limited set of specified possibilities, currently assumed 
necessary and sufficient to reflect key features of interest in the problem under 
study. In the language of Section 2.2, we are operating in terms of a fixed frame 
of discourse, defined in the light of our current knowledge and assumptions, AJ. 
However, as many critics have pointed out, this activity constitutes only one static 
phase of the wider, evolving, scientific learning and decision process. In the more 
general, dynamic, context, this activity has to be viewed, either potentially or 
actually, as sandwiched between two other vital processes. On the one hand, the 
creative generation of the set of possibilities to be considered; on the other hand, the 
critical questioning of the adequacy of the currently entertained set of possibilities 
(see, for example, Box, 1980). We accept that the mode of reasoning encapsulated 
within the quantitative coherence theory as presented here is ultimately conditional, 
and thus not directly applicable to every phase of the scientific process. But we 
do not accept, as Box (1980) appears to, that alternative formal statistical theories 
have a convincing, complementary role to play. 

The problem of generating the frame of discourse, i.e.. inventing new mod- 
els or theories, seems to us to be one which currently lies outside the purview of 
any “statistical” formalism, although some limited formal clarification is actually 
possible within the Bayesian framework, as we shall see in Chapter 4. Substantive 
subject-matter inputs would seem to be of primary importance. although infor- 
mal, exploratory data analysis is no doubt a necessary adjunct and, particularly 
in the context of the possibilities opened up by modern computer graphics. offers 
considerable intellectual excitement and satisfaction in its own right. 
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The problem of criticising the frame of discourse also seems to us to remain 
essentially unsolved by any “statistical” theory. In the case of a “revolution”, or 
even “rebellion”, in scientific paradigm (Kuhn, 1962), the issue is resolved for 
us as Statisticians by the consensus of the subject-matter experts, and we simply 
begin again on the basis of the frame of discourse implicit in the new paradigm. 
However, in the absence of such “externally” directed revision or extension of the 
current frame of discourse, it is not clear what questions one should pose in order 
to arrive at an “internal” assessment of adequacy in the light of the information thus 
far available. 

On the one hand, exploratory diagnostic probing would seem to have a role to 
play in confirming that specific forms of local elaboration of the frame of discourse 
should be made. The logical catch here, however, is that such specific diagnostic 
probing can only stem from the prior realisation that the corresponding specific 
elaborations might be required. The latter could therefore be incorporated ab initio 
into the frame of discourse and a fully coherent analysis carried out. The issue 
here is one of pragmatic convenience, rather than of circumscribing the scope of 
the coherent theory. 

On the other hand, the issue of assessing adequacy in relation to a total absence 
of any specific suggested elaborations seems to us to remain an open problem. 
Indeed, it is not clear that the “problem” as usually posed is well-formulated. For 
example, is the key issue that of “surprise”; or is some kind of extension of the 
notion of a decision problem required in order to give an operational meaning to 
the concept of “adequacy”? 

Readers interested in this topic will find in Box (1980), and the ensuing dis- 
cussion, a range of reactions. We shall return to these issues in Chapter 6. Related 
issues arise in discussions of the general problem of assessing, or “calibrating”, 
the external, empirical performance of an internally coherent individual; see, for 
example, Dawid (1982a). 

Overall, our responses to critics who question the relevance of the coherent 
approach based on a fixed frame of reference can be summarised as follows. So 
far as the scope and limits of Bayesian theory are concerned: (i) we acknowledge 
that the mode of reasoning encapsulated within the quantitative coherence theory 
is ultimately conditional, and thus not directly applicable to every phase of the 
scientific process; (ii) informal, exploratory techniques are an essential part of the 
process of generating ideas; there can be no purely “statistical” theory of model 
formulation; this aspect of the scientific process is not part of the foundational 
debate, although the process of passing from such ideas to their mathematical 
representation can often be subjected to formal analysis; (iii) we all lack a decent 
theoretical formulation of and solution to the problem of global model criticism in 
the absence of concrete suggested alternatives. 

However, critics of the Bayesian approach should recognise that: (i) an enor- 
mous amount of current theoretical and applied statistical activity is concerned 
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with the analysis of uncertainty in the context of models which are accepted. for 
the purposes of the analysis, as working frames of discourse, subject only to local 
probing of specific potential elaborations, and (ii) our arguments thus far. and those 
to follow, are an attempt to convince the reader that within this latter context there 
are compelling reasons for adopting the Bayesian approach to statistical theory and 
practice. 


Updating Subjective Probability 
An issue related to the topic just discussed is that of the mechanism for updating 
subjective probabilities. 

In Section 2.4.2, we defined, in terms of a conditional uncertainty relation, 
<,;, the notion of the conditional probability, P(E |G), of an event E given the 
assumed occurrence of an event G'. From this, we derived Bayes’ theorem, which 
establishes that p(f' |G) = P(G| E)P(E)/P(G). If we actually know for certain 
that G has occurred, P(E | G) becomes our actual degree of belief in E. The prior 
probability P(E), has been updated to the posterior probability P(E |G). 

However, a number of authors have questioned whether it is justified to identify 
assessments made conditional on the assumed occurrence of G with actual beliefs 
once G is known. We shall not pursue this issue further, although we acknowledge 
its interest and potential importance. Detailed discussion and relevant references 
can be found in Diaconis and Zabell (1982), who discuss, in particular. Jeffrey's 
rule (Jetfrey, 1965/1983). and Goldstein (1985). who examines the role of temporal 
coherence. See. also, Good (1977). 


Relevance of the Axiomatic Approach 


Arguments against over-concern with foundational issues come in many forms, 
At one extreme, we have heard Bayesian colleagues argue that the mechanics and 
flavour of the Bayesian inference process have their own sufficient, direct. intuitive 
appeal and do not need axiomatic reinforcement. Another form of this argument 
asserts that developments from axiom systems are “pointless” because the con- 
clusions are, tautologically, contained in the premises. Although this is literally 
true, we simply do not accept that the methodological imperatives which flow from 
the assumptions of quantitative coherence are in any way “obvious” to someone 
contemplating the axioms. At the other extreme. we have heard proponents of 
supposedly “model-free” exploratory methodology proclaim that we can evolve 
towards “goad practice” by simply giving full encouragement to the creative imag- 
ination and then “seeing what works”. 

Our objection to both these attitudes is that they each implicitly assume. albeit 
from different perspectives. the existence of a commonly agreed notion of what 
constitutes “desirable statistical practice”. This does not seem to us a reasonable 
assumption at all. and to avoid potential confusion. an operational definition of the 
notion is required. The quantitative coherence approach ts based on the assumption 
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that, within the structured framework set out in Section 2.2, desirable practice 
requires, at least, to avoid Dutch-book inconsistencies, an assumption which leads 
to the Bayesian paradigm for the revision of belief. 


Structure of the Set of Relevant Events 


But is the structure assumed for the set of relevant events too rigid? In particular, 
is it reasonable to assume that, in each and every context involving uncertainty, 
the logical description of the possibilities should be forced into the structure of 
an algebra (or o-algebra), in which each event has the same logical status? It 
seems to us that this may not always be reasonable and that there is a potential 
need for further research into the implications of applying appropriate concepts of 
quantitative coherence to event structures other than simple algebras. For example, 
this problem has already been considered in relation to the foundations of quantum 
mechanics, where the notion of “sample space” has been generalised to allow for 
the simultaneous representation of the outcomes of a set of “related” experiments 
(see, for example, Randall and Foulis, 1975). In that context, it has been established 
that there exists a natural extension of the Bayesian paradigm to the more general 
setting. 

Another area where the applicability of the standard paradigm has been ques- 
tioned is that of so-called “knowledge-based expert systems”, which often operate 
on knowledge representations which involve complex and loosely structured spaces 
of possibilities, including hierarchies and networks. Proponents of such systems 
have argued that (Bayesian) probabilistic reasoning is incapable of analysing these 
structures and that novel forms of quantitative representations of uncertainty are 
required (see Spiegelhalter and Knill-Jones, 1984, and ensuing discussion, for refer- 
ences to these ideas). However, alternative proposals, which include “fuzzy logic”, 
“belief functions” and “confirmation theory”, are, for the most part, ad hoc and 
the challenge to the probabilistic paradigm seems to us to be elegantly answered 
by Lauritzen and Spiegelhalter (1988). We shall return to this topic later in this 
section. 

Finally, another form of query relating to the logical status of events is some- 
times raised (see, for example, Barnard, 1980a). This draws attention to the in- 
terpretational asymmetry between a statement like “the underlying distribution is 
normal” and its negation. This raises questions about their implicitly symmetric 
treatment within the framework given in Section 2.2. Choices of the elements to 
be included in € are, of course, bound up with general questions of “modelling” 
and the issue here seems to us to be one concerning sensible modelling strategies. 
We shall return to this topic in Chapters 4 and 6. 


Prescriptive Nature of the Axioms 


When introducing our formal! development, we emphasised that the Bayesian foun- 
dational approach is prescriptive and not descriptive. We are concerned with un- 
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derstanding how we ought to proceed. if we wish to avoid a specified form of 
behavioural inconsistency. We are not concerned with sociological or psychologi- 
cal description of actual behaviour. For the latter, see, forexample. Wallsten (1974). 
Kahneman and Tversky (1979). Kahneman et a/, (1982), Machina (1987), Bordley 
(1992), Luce (1992) and Yilmaz (1992). See. also, Savage (1980). 

Despite this, many critics of the Bayesian approach have somehow taken 
comfort from the fact that there is empirical evidence, from experiments involving 
hypothetical gambles, which suggests that people often do not act in conformity 
with the coherence axioms: see, for example. Allais (1953) and Ellsberg (1961). 

Allais’ criticism is based on a study of the actual preferences of individuals 
in contexts where they are faced with pairs of hypothetical situations, like those 
described in Figure 2.8, in each of which a choice has to be made between the two 
options where C’ stands for current assets and the numbers describe thousands of 
units of a familiar currency. 


OO + CO 


Situation | 


2500 +0 


MO eC 


C 


DO+CE 


Situation 2 


2500 4 


Figure 2.8 Av illustration of Allais’ paradox 


It has been found (see, for example, Allais and Hagen, 1979) that there are a 
great many individuals who prefer option | to option 2 in the first situation. and at 
the same time prefer option 4 to option 3 in the second situation. 

To examine the coherence of these two revealed preferences, we note that, 
if they are to correspond to a consistent utility ordering, there must exist a utility 
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function u(.), defined over consequences (in this case, total assets in thousands of 
monetary units), satisfying the inequalities 


u(500 + C) > 0.10 u(2, 500 + C) + 0.89 u(500 + C) + 0.01 u(C) 


0.10 u(2, 500 + C) + 0.90 u(C) > 0.11 u(500 + C) + 0.89 u(C). 


But simple rearrangement reveals that these inequalities are logically incom- 
patible for any function u(.), and, therefore, the stated preferences are incoherent. 

How should one react to this conflict between the compelling intuitive attrac- 
tion (for many individuals) of the originally stated preferences, and the realisation 
that they are not in accord with the prescriptive requirements of the formal theory? 
Allais and his followers would argue that the force of examples of this kind is so 
powerful that it undermines the whole basis of the axiomatic approach set out in 
Section 2,3. This seems to us a very peculiar argument. It is as if one were to argue 
for the abandonment of ordinary logical or arithmetic rules, on the grounds that 
individuals can often be shown to perform badly at deduction or long division. 

The conclusion to be drawn is surely the opposite: namely, the more liable 
people are to make mistakes, the more need there is to have the formal prescription 
available, both as a reference point, to enable us to discover the kinds of mistakes 
and distortions to which we are prone in ad hoc reasoning, and also as a suggestive 
source of improved strategies for thinking about and structuring problems. 


Table 2.4 Savage's reformulation of Allais' example 


Ticket number 1 2-11 12-100 


Situation | option | 500+C 5004+C 5004C 
option 2 Cc 2500+C 500+C 


situation 2 option 3 500+C 5004+C Cc 
option 4 Cc 2500 + C Cc 


In the case of Allais’ example, Savage (1954/1972, Chapter 5) pointed out 
that a concrete realisation of the options described in the two situations could be 
achieved by viewing the outcomes as prizes from a lottery involving one hundred 
numbered tickets, as shown in Table 2.4. Indeed, when the problem is set out in 
this form, it is clear that if any of the tickets numbered from 12 to 100 is chosen it 
will not matter, in either situation, which of the options is selected. Preferences in 
both situations should therefore only depend on considerations relating to tickets in 
the range from | to 11. But, for this range of tickets, situations 1 and 2 are identical 
in structure, so that preferring option | to option 2 and at the same time preferring 
option 4 to option 3 is now seen to be indefensible. 
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Viewed in this way, Allais’ problem takes on the appearance of a decision- 
theoretic version of an “optical illusion” achieved through the distorting effects of 
“extreme” consequences, which go far beyond the ranges of our normal experience. 
The lesson of Savage's analysis is that, when confronted with complex or tricky 
problems. we must be prepared to shift our angle of vision in order to view the 
structure in terms of more concrete and familiar images with which we feel more 
comfortable. 

Ellsberg’s (1961) criticism is of a similar kind to Allais’. but the “distorting” 
elements which are present in his hypothetical gambles stem from the rather vague 
nature of the uncertainty mechanisms involved, rather than from the extreme nature 
of the consequences. In such cases. where confusion is engendered by the proba- 
bilities rather than the utilities, the perceived incoherence may. in fact, disappear 
if one takes into account the possibility that the experimental subjects’ utility may 
be a function of more than one attribute. In particular, we may need to consider 
the attribute “avoidance of looking foolish”. often as a result of thinking that there 
is a “right answer” if the problem seems predominantly to do with sorting out 
“experimentally assigned” probabilities, in addition to the monetary consequences 
specified in the hypothetical gambles. Even without such refinements, however, 
and arguing solely in terms of the gambles themselves, Raiffa (1961) and Roberts 
(1963) have provided clear and convincing rejoinders to the Ellsberg criticism. In- 
deed. Roberts presents a particularly lucid and powerful defence of the axioms. 
also making use of the analogy with “optical” and “magical” illusions. The form 
of argument used is similar to that in Savage's rejoinder to Allais. and we shall 
not repeat the details here. For a recent discussion of both the Allais and Ellsberg 
phenomena. see Kadane (1992). 


Precise, Complete, Quantitative Preferences 


In our axiomatic development we have not made the a priori assumption that all 
options can be compared directly using the preference relation. We have. how- 
ever, assumed, in Axiom 5, that all consequences and certain general forms of 
dichotomised options can be compared with dichotomised options involving stan- 
dard events. This latter assumption then turns out to imply a quantitative basis for 
all preferences, and hence for beliets and values. 

The view has been put forward by some writers (e.g. Keynes. 1921/1929, and 
Koopman, 1940) that not all degrees of belief are quantifiable, or even comparable. 
However, beginning with Jeffreys’ review of Keynes’ Treatise (see also Jeffreys. 
1931/1973) the genera] response to this view has been that some form of quantifica- 
tion is essential if we are to have an operational. scientifically useful theory. Other 
references, together with a thorough review of the mathematical consequences of 
these kind of assumptions. are given by Fine (1973. Chapter 2). 

Nevertheless, there has been a widespread feeling that the demand for precise 
quantification. implicit in “standard” axiom systems, is rather severe and certainly 
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ought to be questioned. We should consider, therefore, some of the kinds of sug- 
gestions that have been put forward from this latter perspective. 

Among the attempts to present formal alternatives to the assumption of pre- 
cise quantification are those of Good (1950, 1962), Kyburg (1961), Smith (1961), 
Dempster (1967, 1985), Walley and Fine (1979), Girdn and Rios (1980), DeRober- 
tis and Hartigan (1981), Walley (1987, 1991) and Nakamura (1993). In essence, 
the suggestion in relation to probabilities is to replace the usual representation of 
a degree of belief in terms of a single number, by an interval defined by two num- 
bers, to be interpreted as “upper” and “lower” probabilities. So far as decisions are 
concerned, such theories lead to the identification of a class of “would-be” actions, 
but provide no operational guidance as to how to choose from among these. Par- 
ticular ideas, such as Dempster’s (1968) generalization of the Bayesian inference 
mechanism, have been shown to be suspect (see, for example, Aitchison, 1968), but 
have led on themselves to further generalizations, such as Shafer’s (1976, 1982a) 
theory of “belief functions”. This has attracted some interest (see e.g., Wasserman 
(1990a, 1990b), but its operational content has thus far eluded us. 

In general, we accept that the assumption of precise quantification, i.e., that 
comparisons with standard options can be successively refined without limit, is 
clearly absurd if taken literally and interpreted in a descriptive sense. We therefore 
echo our earlier detailed commentary on Axiom 5 in Section 2.3, to the effect 
that these kinds of proposed extension of the axioms seem to us to be based on 
a confusion of the descriptive and the prescriptive and to be largely unnecessary. 
It is rather as though physicists and surveyors were to feel the need to rethink 
their practices on the basis of a physical theory incorporating explicit concepts of 
upper and lower lengths. We would not wish, however, to be dogmatic about this. 
Our basic commitment is to quantitative coherence. The question of whether this 
should be precise, or allowed to be imprecise, is certainly an open, debatable one, 
and it might well be argued that “measurement” of beliefs and values is not totally 
analogous to that of physical “length”. An obvious, if often technically involved 
solution, is to consider simultaneously all probabilities which are compatible with 
elicited comparisons. This and other forms of “robust Bayesian” approaches will 
be reviewed in Section 5.6.3. In this work, we shall proceed on the basis of a 
prescriptive theory which assumes precise quantification, but then pragmatically 
acknowledges that, in practice, all this should be taken with a large pinch of salt and 
a great deal of systematic sensitivity analysis. For a related practical discussion, 
see Hacking (1965). See, also, Chateaneuf and Jaffray (1984). 


Subjectivity of Probability 

As we stressed in Section 2.2, the notion of preference between options, the primi- 
tive operational concept which underlies all our other definitions, is to be understood 
as personal, in the sense that it derives from the response of a particular individual 
to a decision making situation under uncertainty. A particular consequence of this 
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is that the concept which emerges is personal degree of belief, defined in Section 2.4 
and subsequently shown to combine for compound events in conformity with the 
properties of a finitely additive probability measure. 

The “individual” referred to above could, of course, be some kind of group, 
such as a committee, provided the latter had agreed to “speak with a single voice”. 
in which case, to the extent that we ignore the processes by which the group arrives 
at preferences, it can conveniently be regarded as a “person”. Further comments 
on the problem of individuals versus groups will be given later under the heading 
Communication and Group Decision Making. 

This idea that personal (or subjective) probability should be the key to the 
“scientific” or “rational” treatment of uncertainty has proved decidedly unpalatable 
to many statisticians and philosophers (although in some application areas. such as 
actuarial science, it has met with a more favourable reception: see Clarke, 1954). At 
the very least, it appears to offend directly against the general notion that the meth- 
ods of science should, above all else, have an “objective” character. Nevertheless, 
bitter though the subjectivist pill may be, and admittedly difficult to swallow, the 
alternatives are either inert, or have unpleasant and unexpected side-effects or, to 
the extent that they appear successful, are found to contain subjectivist ingredients. 

From the objectivistic standpoint, there have emerged two alternative kinds 
of approach to the definition of “probability” both seeking to avoid the subjective 
degree of belief interpretation. The first of these retains the idea of probability as 
measurement of partial belief, but rejects the subjectivist interpretation of the latter. 
regarding it. instead, as a unique degree of partial /ogical implication between one 
statement and another. The second approach. by far the most widely accepted 
in some form or another. asserts that the notion of probability should be related 
in a fundamental way to certain “objective” aspects of physical reality, such as 
symmetries or frequencies. 

The logical view was given its first explicit formulation by Keynes (1921/1929) 
and was later championed by Carnap (1950/1962) and others: it is interesting to note, 
however, that Keynes seems subsequently to have changed his view and acknowl- 
edged the primacy of the subjectivist interpretation (see Good, 1965, Chapter 2). 
Brown (1993) proposes the related concept of “impersonal” probability. 

From an historical point of view, the first systematic foundation of the frequen- 
tist approach is usually attributed to Venn (1886). with later influential contributions 
from von Mises (1928) and Reichenbach (1935). The case for the subjectivist ap- 
proach and against the objectivist alternatives can be summarised as follows. 

The lugical view is entirely lacking in operational content. Unique probability 
values are simply assumed to exist as a measure of the degree of implication between 
one statement and another, to be intuited, in some undefined way, from the formal 
structure of the language in which these statements are presented. 

The symmetry (or classical) view asserts that physical considerations of sym- 
metry lead directly to a primitive notion of “equally likely cases”. But any uncertain 
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situation typically possesses many plausible “symmetries”: a truly “objective” the- 
ory would therefore require a procedure for choosing a particular symmetry and for 
justifying that choice. The subjectivist view explicitly recognises that regarding 
a specific symmetry as probabilistically significant is itself, inescapably, an act of 
personal judgement. 

The frequency view can only attempt to assign a measure of uncertainty to 
an individual event by embedding it in an infinite class of “similar” events having 
certain “randomness” properties, a “collective” in von Mises’ (1928) terminology, 
and then identifying “probability” with some notion of limiting relative frequency. 
But an individual event can be embedded in many different “collectives” with no 
guarantee of the same resulting limiting relative frequencies: a truly “objective” 
theory would therefore require a procedure for justifying the choice of a particular 
embedding sequence. Moreover, there are obvious difficulties in defining the un- 
derlying notions of “similar” and “randomness” without lapsing into some kind of 
circularity. The subjectivist view explicitly recognises that any assertion of “‘simi- 
larity” among different, individual events is itself, inescapably, an act of personal 
judgement, requiring, in addition, an operational definition of which is meant by 
“similar”. 

In fact, this latter requirement finds natural expression in the concept of an 
exchangeable sequence of events, which we shall discuss at length in Chapter 4. 
This concept, via the celebrated de Finetti representation theorem, provides an 
elegant and illuminating explanation, from an entirely subjectivistic perspective, of 
the fundamental role of symmetries and frequencies in the structuring and evaluation 
of personal beliefs. It also provides a meaningful operational interpretation of the 
word “objective” in terms of “intersubjective consensus”. 

The identification of probability with frequency or symmetry seems to us to 
be profoundly misguided. It is of paramount importance to maintain the distinction 
between the definition of a general concept and the evaluation of a particular 
case. In the subjectivist approach, the definition derives from logical notions of 
quantitative coherent preferences: practical evaluations in particular instances often 
derive from perceived symmetries and observed frequencies, and it is only in this 
evaluatory process that the latter have a role to play. 

The subjectivist point of view outlined above is, course, not new and has been 
expounded at considerable length and over many years by a number of authors. The 
idea of probability as individual “degree of confidence” in an event whose outcome 
is uncertain seems to have been first put forward by James Bernoulli (1713/1899). 
However, it was not until Thomas Bayes’ (1763) famous essay that it was explicitly 
used as a definition: 


The probability of any event is the ratio between the value at which an expectation 
depending on the happening of the event ought to be computed, and the value of 
the thing expected upon its happening. 
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Not only is this directly expressed in terms of operational comparisons of 
certain kinds of simple options on the basis of expected values, but the style of 
Bayes’ presentation strongly suggests that these expectations were to be interpreted 
as personal evaluations. 

A number of later contributions to the field of subjective probability are 
collected together and discussed in the volume edited by Kyburg and Smokler 
(1964/1980), which includes important seminal papers by Ramsey (1926) and 
de Finetti (1937/1964). An exhaustive and profound discussion of all aspects 
of subjective probability is given in de Finetti's magisterial Theory of Probabil- 
itv (1970/1974, 1970/1975). Other interpretations of probability are discussed in 
Renyi (1955), Good (1959), Kyburg (1961, 1974), Fishburn (1964), Fine (1973), 
Hacking (1975), de Finetti (1978), Walley and Fine (1979) and Shafer (1990). 


Statistical Inference as a Decision Problem 


Stylised statistical problems have often been approached from adecision-theoretical 
viewpoint, see, for instance, the books by Ferguson (1967), DeGroot (1970), Bar- 
nett (1973/1982), Berger (1985a) and references therein. However, we have already 
made clear that, in our view, the supposed dichotomy between inference and deci- 
sion is illusory, since any report or communication of beliefs following the receipt 
of information inevitably itself constitutes a form of action. In Section 2.7. we 
formalised this argument and characterised the utility structure that is typically ap- 
propriate for consequences in the special case of a “pure inference” problem. The 
expected utility of an “experiment” in this context was then seen to be identified 
with expected information (in the Shannon sense), and a number of information- 
theoretic ideas and their applications were given a unified interpretation within a 
purely subjectivist Bayesian framework. 

Many approaches to statistical inference do not, of course, assign a primary role 
to reporting probability distributions, and concentrate instead on stylised estimation 
and hypothesis testing formulations of the problem (see Appendix B. Section 3). 
We shall deal with these topics in more detail in Chapters 5 and 6. 


Communication and Group Decision Making 

The Bayesian approach which has been presented in this chapter is predicated on the 
primitive notion of individual preference. A seemingly powerful argument against 
the use of the Bayesian paradigm ts therefore that it provides an inappropriate 
basis for the kinds of interpersonal communication and reporting processes which 
characterise both public debate about beliefs regarding scientific and social issues. 
and also “cohesive-small-group” decision making processes. We believe that the 
two contexts, “public” and “cohesive-small-group”, pose rather different problems, 
requiring separate discussion. 
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In the case of the revision and communication of beliefs in the context of 
general scientific and social debate, we feel that criticism of the Bayesian paradigm 
is largely based on a misunderstanding of the issues involved, and on an over- 
simplified view of the paradigm itself, and the uses to which it can be put. So far as 
the issues are concerned, we need to distinguish two rather different activities: on 
the one hand, the prescriptive processes by which we ought individually to revise 
our beliefs in the light of new information if we aspire to coherence; on the other 
hand, the pragmatic processes by which we seek to report to and share perceptions 
with others. The first of these processes leads us inescapably to the conclusion that 
beliefs should be handled using the Bayesian paradigm; the second reminds us that 
a “one-off” application of the paradigm to summarise a single individual's revision 
of beliefs is inappropriate in this context. 

But, so far as we are aware, no Bayesian statistician has ever argued that the 
latter would be appropriate. Indeed, the whole basis of the subjectivist philosophy 
predisposes Bayesians to seek to report a rich range of the possible belief mappings 
induced by a data set, the range being chosen both to reflect (and even to challenge) 
the initial beliefs of a range of interested parties. Some discussion of the Bayesian 
reporting process may be found in Dickey (1973), Dickey and Freeman (1975) 
and Smith (1978). Further discussion is given in Smith (1984), together with a 
review of the connections between this issue and the role of models in facilitating 
communication and consensus. This latter topic will be further considered in 
Chapter 4. 

We concede that much remains to be done in developing Bayesian reporting 
technology, and we conjecture that modern interactive computing and graphics 
will have a major role to play. Some of the literature on expert systems is relevant 
here; see, for instance, Lindley (1987), Spiegelhalter (1987) and Gaul and Schader 
(1988). On the broader issue, however, one of the most attractive features of the 
Bayesian approach is its recognition of the legitimacy of the plurality of (coherently 
constrained) responses to data. Any approach to scientific inference which seeks to 
legitimise an answer in response to complex uncertainty seems to us a totalitarian 
parody of a would-be rational human learning process. 

On the other hand, in the “cohesive-small-group” context there may be an 
imposed need for group belief and decision. A variety of problems can be isolated 
within this framework, depending on whether the emphasis is on combining prob- 
abilities, or utilities, or both; and on how the group is structured in relation to such 
issues as “democracy”, “information-sharing”. “negotiation” or “competition”. It 
is not yet clear to us whether the analyses of these issues will impinge directly on 
the broader controversies regarding scientific inference methodology. and so we 
shall not attempt a detailed review of the considerable literature that is emerging. 

Useful introductions to the extensive literature on amalgamation of beliefs or 
utilities. together with most of the key references, are provided by Arrow (1951b), 
Edwards (1954), Luce and Raiffa (1957), Luce (1959), Stone (1961), Blackwell 
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and Dubins (1962), Fishburn (1964, 1968, 1970, 1987), Kogan and Wallace (1964). 
Wilson (1968), Winkler (1968, 1981), Sen (1970), Kranz et al. (1971), Marschak 
and Radner (1972), Cochrane and Zeleny (1973), DeGroot (1974, 1980), Mor- 
ris (1974), White and Bowen (1975), White (1976a, 1976b), Press (1978. 1980b. 
1985b), Lindley er a/. (1979), Roberts (1979), Hogarth (1980). Saaty (1980), Berger 
(1981), French (1981, 1985, 1986, 1989), Hylland and Zeckhauser (1981), Weer- 
ahandi and Zidek (1981, 1983), Brown and Lindley (1982. 1986), Chankong and 
Haimes (1982), Edwards and Newman (1982), DeGroot and Feinberg (1982, 1983, 
1986), Raiffa (1982). French er al. (1983), Lindley (1983. 1985. 1986). Bunn 
(1984), Caro eral. (1984), Genest (1984a, 1984b), Yu (1985). De Waal et al. (1986). 
Genest and Zidek (1986), Arrow and Raynaud (1987), Clemen and Winkler (1987. 
1993), Kim and Roush (1987). Barlow e¢ a/. (1988), Bayarri and DeGroot (1988. 
1989, 1991), Huseby (1988). West (1988. 1992a). Clemen (1989, 1990). Rios ef «i. 
(1989), Seidenteld er a/. (1989), Rios (1990), DeGroot and Mortera (1991), Kelly 
(1991). Lindley and Singpurwalla (1991, 1993), Goel er al. (1992), Goicoechea 
et al. (1992). Normand and Tritchler (1992) and Gilardoni and Clayton (1993). 
Important, seminal papers are reproduced in Gardenfors and Sahlin (1968). For 
related discussion in the context of policy analysis. see Hodges (1987). 

References relating to the Bayesian approach to game theory include Harsany 
(1967), DeGroot and Kadane (1980), Eliashberg and Winkler (1981), Kadane 
and Larkey (1982, 1983), Raiffa (1982). Wilson (1986). Aumann (1987), Smith 
(1988b), Nau and McCardle (1990). Young and Smith (1991), Kadane and Seiden- 
feld (1992) and Keeney (1992). 

A recent review of related topics, followed by an informative discussion, is 
provided by Kadane (1993). 
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Chapter 3 


Generalisations 


Summary 


The ideas and results of Chapter 2 are extended to a much more general mathemat- 
ical setting. An additional postulate concerning the comparison of a countable 
collection of events is introduced and is shown to provide a justification for re- 
stricting attention to countably additive probability as the basis for representing 
beliefs. The elements of mathematical probability theory are reviewed. The no- 
tions of options and utilities are extended to provide a very general mathematical 
framework for decision theory. A further additional postulate regarding prefer- 
ences is introduced, and is shown to justify the criterion of maximising expected 
utility within this more general framework. In the context of inference problems, 
generalised definitions of score functions and of measures of information and 
discrepancy are given. 


3.1 GENERALISED REPRESENTATION OF BELIEFS 
3.1.1 Motivation 


The developments of Chapter 2, based on Axioms | to S, led to the fundamental 
result that quantitatively coherent degrees of belief for events belonging to the 
algebra E should fit together in conformity with the mathematical rules of finitely 
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additive probability. From a directly practical and intuitive point of view. there 
seems no compelling reason to require anything beyond this finitistic framework, a 
view argued forcefully, and in great detail, by de Finetti (1970/1974. 1970/1975). 

However, there are many situations where the implied necessity of choosing 
a particular finitistic representation of a problem can lead to annoying conceptual 
and mathematical complications, as we remarked in Section 2.5.1 when discussing 
bounded sets of consequences. The example given in that context involved the 
problem of representing the length of remaining life of a medical patient. Most 
people would accept that there is an implicit upper bound, but find it difficult 
to justify any particular choice of its value. Similar problems obviously arise in 
representing other forms of survival time (of equipment, transplanted organs, or 
whatever) and further difficulties occur in representing the possible outcomes of 
many other measurement processes, since these are generally regarded as being 
on a continuous scale. For these reasons, and provided we do not feel that in so 
doing we are distorting essential features of our beliefs, it is certainly attractive, 
from the point of view of descriptive and mathematical convenience, to consider 
the possibility of extending our ideas beyond the finite, discrete framework. 

In Section 3.1.2, we shall provide a formal extension of the quantitative coher- 
ence theory to the infinite domain. Our fundamental conclusion about beliefs in this 
setting will be that quantitatively coherent degrees of belief for events belonging 
to a o-algebra E should fit together in conformity with the mathematical rules of 
countably additive probability. 

The major mathematical advantage of this generalised framework is that all 
the standard manipulative tools and results of mathematical probability theory then 
become available to us; convenient references are, forexample. Kingman and Taylor 
(1966) or Ash (1972). A selection of these tools and results will be reviewed in 
Section 3.2, and then used in Section 3.3 to develop natural extensions of our 
finitistic definitions of actions and utilities, thus establishing an extremely general 
mathematical setting for the representation and analysis of decision problems. The 
important special case of inference as a decision problem will be considered in 
Section 3.4, which extends to the general mathematical framework the discussion 
of the finite. discrete case. given in Section 2.7. Finally. a discussion of some 
particular issues is given in Section 3.5. 


3.1.2 Countable Additivity 


In Definition 2.1 and the subsequent discussion, we assumed that the collection € 
of events included in the underlying frame of discourse should be closed under the 
operations of arbitrary finite intersections and unions. As the first step in providing 
a mathematical extension to the infinite domain, we shall now assume that we 
allow arbitrary countable intersections and unions in €, so that the latter is taken 
to be a a-algebra. Within this extended structure. Axioms | to 5 will continue to 
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encapsulate the requirements of quantitative coherence for preferences, and hence 
for degrees of belief, provided that only finite combinations of events of € are 
involved. However, if we wish to deal with countable combinations of events, we 
shall need an extension of the existing requirements for quantitative coherence. 
One possible such extension is encapsulated in the following postulate. 


Postulate 1. (Monotone continuity). “ 
If, for all j, Ej 2 Ej4: and Ej >F, then () Ej; > F. 


j=l 


Discussion of Postulate 1. \f the relation E; > F holds for every member of a 
decreasing sequence of events E, D E2..., and if we accept the limit event N; E; 
into our frame of discourse, then it would seem very natural in terms of “continuity” 
that the relation should “carry over”. The operational justification for considering 
such a countable sequence of comparisons is certainly open to doubt. However, if. 
for descriptive and mathematical convenience, we admit the possibility, then this 
form of continuity would seem to be a minimal requirement for coherence. 


Proposition 3.1. (Continuity at). If, forall j, E; > Ejs 
x 
and (VE; = Q, then, for anyG >. lim P(E;|G) =0. 
j-x 


j=l 


Proof. We note first that the condition encapsulated in Postulate | carries 
over to conditional preferences. By Proposition 2.14(i), if Ej > ¢ F then we have 
E;NG > F NG; moreover, £, NG D Ej419G, for all j = 1,2,..., and thus, 
by Postulate 1, (N, E;) NG > FG. It now follows from Proposition 2.14(i) 
that (1), E;) > F. By Proposition 2.16, P(E; |G) > P(Ej+1|G) > 0, and 
so there exists a number p > 0 such that lim, P(E; |G) = p, and, for all j, 
P(E; |G) > p. By Axiom 4(iii) and Proposition 2.16, there exists a standard event 
S such that u(S) = p, G L S and, for all 7, Ej; >¢ S. Hence, by the above, we 
have ((); Ej) = >c S, which implies that S ~ 0 and thus, by Propositions 2.10 
and2.11,thatp=0. g 


Since we have already established in Proposition 2.17 that P(. | G) isa finitely 
additive probability measure, the above result, based on the postulate of mono- 
tone continuity, enables us to establish immediately that, in this extended setting, 
P(.|G) is a countably additive probability measure. 


Proposition 3.2. (Countably additive structure of degrees of belief). 
If{E,, j =1.2,...} are disjoint events in E, and G > 0, then 


r(Uziic) = >> P(E; |G). 
j=l j=l 
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Proof. Since P(.|G) is finitely additive we have, for any n > 1, 


P (U E; ic) = S° P(E) |G) + P(Fi |G) 
j=l 


j=l 


where 


F,, = U E;. 


gor 


It follows that F,, D F,.; with (),, F;, = @; hence, by Proposition 3.1, 


lim P(F, |G) = 0. 
The result follows by taking limits in the last expression for P(U; E) |G). 4 


We shall consider the finite versus countable additivity debate in a little more 
detail in Section 3.5.2. For the present, we simply note that philosophical allegiance 
to the finitistic framework is in no way incompatible with the systematic adoption 
and use of countably additive probability measures for the overwhelming majority 
of applications. The debate centres on whether this particular restriction to a 
subclass of the finitely additive measures should be considered as a necessary 
feature of quantitative coherence, or whether it is a pragmatic option, outside 
the quantitative coherence framework encapsulated in Axioms | to 5. From a 
philosophical point of view, we identify strongly with this latter viewpoint; but 
in almost all the developments which follow we shall rarely feel discomfited by 
implicitly working within a countably additive framework. 


We have established (in Propositions 2.4 and 2.10) that if E and F are events 
in € with E C Fy then P(E) < P(F). so that if P(F) = 0 then P(E) = 0. 
However, in general, not all subsets of an event of probability zero (a so-called nud/ 
event) will belong to € and so we cannot logically even refer to their probabilities, 
let alone infer that they are zero. In some circumstances it may be desirable, as 
well as mathematically convenient, to be able to do this. If so, this can be done 
“automatically” by simply agreeing that € be replaced by the smallest o-algebra. 
F. which contains € and all the subsets of the null events of € (the so-called 
completion of €). The induced probability measure over F is unique and has the 
property that all subsets of null events are themselves null events. It is called a 
complete probability measure. 


Definition 3.1. (Probability space). A probability space is defined by the 
elements {Q.F.P} where F is a a-algebra of Q and P is a complete. a- 
additive probability measure on F. 
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From now on, our mathematical development will take place within the as- 
sumed structure of a probability space. We do not anticipate encountering situations 
where these mathematical assumptions lead to conceptual distortions beyond the 
usual, inevitable element of mathematical idealisation which enters any formal anal- 
ysis. However, as de Finetti (1970/1974, 1970/1975) has so eloquently warned, 
one must always be on guard and aware that distortions might occur. 

The material which follows (in Section 3.2) will differ in flavour somewhat 
from our preceding discussion of the general foundations of coherent beliefs and 
actions (Chapter 2 and Section 3.1) and our subsequent discussions of generalised 
decision problems (Section 3.3) and of the link between beliefs about observables 
and the structure of specific models for representing such beliefs (Chapter 4). These 
developments systematically invoke the subjectivist, operationalist philosophy as 
a basic motivation and guiding principle. In the next section, we shall concen- 
trate instead on reviewing, from a purely mathematical standpoint, the concepts 
and results from mathematical probability theory which will provide the technical 
underpinning of our theory. 


3.2 REVIEW OF PROBABILITY THEORY 
3.2.1. Random Quantities and Distributions 


In the framework we have been discussing, the constituent possibilities and proba- 
bilities of any decision problem are encapsulated in the structure of the probability 
space {2,F, P}. Now, in a certain abstract sense, we might think of 2 as the 
“primitive” collection of all possible outcomes in a situation of interest; for exam- 
ple, that surrounding the birth of an infant, or the state of international commodity 
markets at a particular time point. However, we are not really interested in a 
“complete description”, even if such were possible, but rather in some numerical 
summary of the outcomes, in the forms of counts or measurements. 


Recalling the discussion of Section 2.8, it might be argued that “measurements” 
are always, in fact, “counts”. However, when convenient, we shall distinguish 
the two in the usual pragmatic (fuzzy) way: “counts” will typically mean integer- 
valued data; “measurements” will typically mean data which we pretend are 
real-valued. 


We move, therefore, from {2, F, P} to a more explicitly numerical setting by 
invoking a mapping, 
r:0-XCR, 


which associates a real number x(w) with each elementary outcome w of 92 (our 
initial exposition will be in terms of a single-valued x; the vector extension will be 


110 3 Generalisations 


made in Section 3.2.4). Subsets of 2 are thus mapped into subsets of 3 and the 
probability measure P defined on F will induce a probability measure. P,, say. 
over appropriate subsets of R. However, we shall wish to ensure that P, is defined 
on certain special subsets of R, for example. intervals such as (—x.a].a € Rin 
which case we shall want sets of the form {w: -— x < #(w) < a}.a € R.to 
belong to F, and this will constrain the class of functions . which we would wish 
to use to define the numerical mapping. The standard requirement is that /?, be 
definable on the a-algebra of Bore/ sets, B, of R, the smallest g-algebra containing 
intervals of the form (—2<.a},a € 8, and hence all forms of interval. since the 
latter can be generated by appropriate countable unions and intersections of the 
intervals (—9c.a].a € R. 


Definition 3.2. (Random quantity). —_ A random quantity on a probability 
space {Q. F. P} is a function x: Q0— X CR such that .«~'(B) € F, for 
all B € B. 


Following de Finetti (1970/1974, 1970/1975), we use the term random qguan- 
tity to signify a numerical entity whose value is uncertain. rather than use the 
traditional, but potentially confusing. term random variable, which might suggest 
a restriction to contexts involving repeated “trials” over which the quantity may 
vary. Notationally, we shall use the same symbol for both a random quantity and 
its value. Thus, for example, 1 may denote a function, or a particular value of the 
function r(w) = .r, say. The interpretation will always be clear from the context. 

For a random quantity x. the induced measure P,, is defined in the natural way 
by 

P.(B) = P(e '(B)). Be B. 


The function /, is easily seen to be a probability measure, and describes the way in 
which probability is “distributed” over the possible values  € .X¥. This information 
can also be encapsulated in a single real-valued function. 


Definition 3.3. (Distribution function). The distribution function of a ran- 
dom quantity 1: Q — X © Ron {Q.F. P} is the function F,: R — [0.1] 
defined by 


F(z) = P{w: rw) <r} = PA{(-—x.-a}}. rer. 


If the probability distribution concentrates on a countable set of values, so 
that X = {2y.22....}. x is called a discrete random quantity and the function 


p, : R — [0.1] such that 


py(e) = Piw: r(w) =r} 
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is called its probability (mass) function. The distribution function is then a step 
function with jumps p,(x;) at each 2;. 

If the probability distribution is such that there exists a real, non-negative 
(measurable) function p, such that 


p(zeB)= f p(tat= f aFz(t) BeB, 


then x is called an (absolutely) continuous random quantity and p, is called its 
density function. In addition, of course, we might have a mixture of both discrete and 
continuous elements. No use of singular distributions will be made in this volume 
(for discussion of such distributions, see, for example, Ash, 1972, Section 2.2). 


We shall use the same notation, p,, for both the mass function of a discrete 
random quantity and the density function of a continuous random quantity. In 
measure-theoretic terms, both are, of course, special cases of the Radon-Nikodym 
derivative. In general, we shall use the notation and results of Lebesgue and 
Lebesgue-Stieltjes integration theory as and when it suits us. Readers unfamiliar 
with these concepts need not worry: virtually none of the machinery will be 
visible, and the meanings of integrals will rarely depend on the niceties of the 
interpretation adopted. Moreover, when there is no danger of confusion, we 
shall often omit the suffix .c in p,(r), using p(2:) both to represent the density 
or mass function p,({-) and its value p,(x) at a particular zx € X. Also, to 
avoid tedious repetition of phrases like “almost everywhere”, we shall. when 
appropriate, simply state that densities are equal, leaving it to be understood that, 
with respect to the relevant measure, this means “equal, except possibly on a set 
of measure zero”. 


If x is a random quantity defined on {2, F, P} such thatz :Q > X CR, 
and if g: R > Y C Risa function such that (go x)~'(B) € F forall B € B, 
then go x is also a random quantity. We shall typically denote g o x by g(x), and, 
whenever we refer to such functions of a random quantity z, it is to be understood 
that the composite function is indeed a random quantity. Writing y = g(x), the 
random quantity y induces a probability space {R, B, P,}, where 


P,(B) = P,({g"'(B)}) = P({(gox)"(B)}), BEB. 
Functions such as F, and p, are defined in the obvious way. These forms are easily 


related to those of F’, and p,. In particular, if g~! exists and is strictly monotonic 
increasing we have 


F,(y) = P,(9(2) $y) = P-(z < 9°'(y)) = F:(9"'(y)) 
and, in the continuous case, if g is monotonic and differentiable, the density py is 
given by 


Py(y) = Fy(y) = Fi(9'(y)) = pe (97 wW5 oe: ae 


Some examples of this relationship are given at the end of mae 3.2.2. 
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Definition 3.4. (Expectation). If x, y are random quantities with y = g(0). 
the expectation of y. Ely] = E|g(.r)], is defined by etther 


So ypaly) = So gle) pele). or 


yey we 


[ ventardy = [ae vc.ovae. 


for the discrete and continuous cases, respectively. where the equality is to be 
interpreted in the sense that if either side exists so does the other and they are 
equal. 


As in Definition 3.4, most sums or integrals over possible values of random 
quantities will involve the complete set of possible values (for example. .V and }’). 
To simplify notation we shall usually omit the range of summation or integration. 
assuming it to be understood from the context. To avoid tiresome duplication. 
we shall also typically use the integral form to represent both the continuous and 
the discrete cases. 


It is useful to be able to summarise the main features of a probability distri- 
bution by quantities defined to encapsulate its location, spread, or shape, often in 
terms of special cases of Definition 3.4. Assuming, in each case. the right-hand 
side to exist, such summary quantities include: 


(i) El.r], the mean of the distribution of the random quantity .r, 
(ii) Ela*], the Ath (absolute) moment. 
(iii) V[2] = E[(e - Elr))?] = Efr?] — Ela], the variance: 
(iv) D[a] = V[r]!’?. the standard deviation; 

(v) Af[r], a mode of the distribution of .x, such that 


p,(Al[x]) = sup p,(2): 
ron 


(vi) Qa{2], an a-quantile of the distribution of «a, such that 
F,(Q. [2]) = P,(@ < Q,[2]) =a: 


(vi) Me[x] = Qus[z]. a median, 
(vii) (Qa-p2[t}. Qa -,):2[2]). a pinterquantile range. 


The expectation operator of Definition 3.4 is linear, so that if .r,..rz are two 
random quantities and c, c2 are finite real numbers then 


Eley, + cave] = 4 Ely) + Elo). 
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In the special case of the transformation g(x) = cz, for some real constant c, we 
clearly have 
Elg(z)] = ¢ E[z] = g(E[z]), 


Dia(2)] = (Vio(2)))" = (2Viel) = 9(Dia). 


For general transformations g(x), E[g(x)| # g(E[x]) and the moments of a trans- 
formed random quantity g(x) do not exactly relate in any straightforward manner 
to those of z. However, for suitably well-behaved g(z) the following result, which 
we shall illustrate at the end of Section 3.2.2, often provides useful approximations. 


Proposition 3.3. (Approximate mean and variance). If x is a random quan- 
tity with E[z] = ys, V[z] = 0? and y = g(x) then, subject to conditions on 
the distribution of x and the smoothness of g, 


Ely] = g(u) + 50°9"(u), 
2 
V[y) = 0? [9'(u)] - 
Outline proof. Expanding g(x) ina Taylor series about jz, we obtain 


g(x) © glu) + (2 - w)g'(u) + 3 (2 — #)?9"(x), 
where we are assuming regularity conditions sufficient to ensure the adequacy of 
this approximation in what follows. Taking expectations immediately yields the 
approximate form for E[y]; subtracting the latter approximation from both sides, 
squaring, taking expectations and ignoring higher order terms, yields the result for 
V [y]. Clearly, more refined approximations are easily obtained by including higher 
orderterms. 4g 


In Definition 2.6, we introduced the notion of the independence of two events 
with subsequent generalisations to mutual independence (Definition 2.12) and con- 
ditional independence (Definition 2.13). These notions can be extended to random 
quantities in the following way. 


Definition 3.5. (Mutual independence). The random quantities r,,...,2n 
are mutually independent if, for any t; € R, the events {w;z;(w) < ti}, for 
i= 1,...,.n, are mutually independent. 


We note that for independent random quantities 7),...,2n. 


I> = Il E{z;] and V ya = S°Vizi]. 


Definition 3.6. (Conditional independence). For any random quantity y, the 
random quantities 2,,...,2» are conditionally independent given y if, for 
any t; € Ri = 1,...,n, the events {w; z;(w) < t;} are conditionally 
independent given the event {w; y(w) < y}. forall y. 


E 
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Conditional independence will play a major role in our later discussion of 
modelling in Chapter 4. 

Many forms of technical manipulation of probability distributions are greatly 
facilitated by working with some suitable transformation of the original density or 
distribution function. One of the most useful such transforms is the following. 


Definition 3.7. (Characteristic function). The characteristic function of a 
random quantity x is the function @,, mapping ¥ to the complex plane, given 
by 

o,(t) = Efe]. ter. 


Among the most important properties of the characteristic function. we note 
the following. 


(i) | é-(t)| < Land 6,(0) = 1. 
(ii) &, is a uniformly continuous function of ft. 
(iii) If ay,.... 2, are independent random quantities, and s = 577_, x,. then 
6,(t) = Tie y(t). 
(iv) Two random quantities have the same distribution if and only if they have the 
same characteristic function. 


(it)! 
J 


b 
(v) If E[x*] < 2x, then o,(t) = s- 


jst 


Er] + o(t*). 


Many similar properties hold for the closely related alternative transforms E{e"']. 
the moment generating function, and E|t'|, the probability generating function. 


3.2.2 Some Particular Univariate Distributions 


In this section, we shall review a number of particular univariate distributions which 
are frequently used in applications, and list some of their properties and character- 
istics. We shall assume that the reader is familiar with most of this material, and 
detailed discussion and derivations are therefore not given. The books by Johnson 
and Kotz (1969, 1970) provide a mass of detail on these and other distributions. 

One important initial warning is required! These distributions provide the 
building blocks for statistical madels and are typically defined in terms of “param- 
eters”. The role and interpretation of “models” and “parameters” within the general 
subjectivist, operationalist framework are extremely important issues, which will 
be discussed at length in Chapter 4. For the present.“ parameters” should simply be 
regarded as “labels” of the various mathematical functions we shall be considering, 
although, as we shall see, these “labelling parameters” often relate closely to one 
or other of the characteristics of the distribution. 
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The Binomial Distribution 


A discrete random quantity x has a binomial distribution with parameters 0 and n 
(0<@<1,n=1,2,...) if its probability function Bi(zx | n, 6) is 


Bi(x|6.n) = (“Jer ~6)"*, r=O)li. un. 


The mean and variance are Ez] = 0, and V{z] = n6(1 — @). A mode is attained 
at the greatest integer Af{x] which does not exceed z,, = (rn + 1)0; if x is an 
integer, then both xr, and z,,,. — 1 are modes. 

If n = 1, x is said to have a Bernoulli distribution, with probability function 
denoted by Br(x | @). The sum of & independent binomial random quantities with 
parameters (@,7;), i = 1,...k, is a binomial random quantity with parameters 0 
and nr, +--+ + ng. 


The Hypergeometric Distribution 


A discrete random quantity x has an hypergeometric distribution with integer pa- 
rameters N, Af andn (n < N + AZ) if its probability function Hy(x | NV, Af. n) 
is 


N A, 
Hy(z|N,M,n) =e (*) (, ! ), max{0,n — Af} <x < min{n, N}, 


where 


The mean and variance are given by 


nN nMN (N+M-—n) 
d Viz\ = ——— ———_—_—_— - 
and Vit] = Typ an? (NF A = 1) 


A mode is attained at the greatest integer AJ [7] which does not exceed 


g, = MHVYNEY, 
mo M+N+4+2 


if x, is an integer, then both x,,, and z,, — 1 are modes. 
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The Negative-Binomial Distribution 


A discrete random quantity 2 has a negative-binomial distribution with parameters 
Oandr (0<@<1,r=1.2....) if its probability function Nb(2 |r. 4) is 


Nb(1r|@.r) = ay 


Ja -@)". wx =0.1.2.... 


where c = 6”. The mean and variance are E|x] = r@ and V2] = r(1 — @)/6?. 
If r(1 — 8) > L the mode AJ[z] is the least integer not less than [r(1 — 4)|/@: if 
r(1 — 4) = 1, there are two modes at 0 and 1; if (1 — 8) < 1, AZ[z] = 0. 


If + = 1, x is said to have a geometric or Pascal distribution. Moreover, 
the sum of & independent negative binomial random quantities with parameters 
(Qr),@= 1.0... k, is a negative binomial random quantity with parameters @ and 
Tybee HPg. 


The Poisson Distribution 


A discrete random quantity x7 has a Poisson distribution with parameter A (A > 0) 
if its probability function Pn(x | A) is 


Pn(.r| A) = fe > = Ob 2e 
wT, 
where c = e *. The mean and variance are given by E(x] = W{2] = ’. A mode 
AJ [2] is attained at the greatest integer which does not exceed X. If A is an integer, 
both and A — | are modes. 
The sum of & independent Poisson random quantities with parameters 4,. 
i = 1.....h, is a Poisson random quantity with parameter A, + «+> + Ax. 


The Beta Distribution 


A continuous random quantity .c has a beta distribution with parameters © and .3 
{a > 0.3 > 0) if its density function Be(.r | a. ,3) is 


Be(rfa.3) ser? ary O< rel. 


where 
MO 3) 


~ P(a)P(3) 
and T(x) = f° ¢" ‘e ‘dt; integer and half-integer values of the gamma function 


are easily found from the recursive relation P(x + 1) = «T(x). and the values 
(1) = land (1/2) = fa = 1.7725. 
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Systematic application of the beta integral, 


: T(a)P(3) 
pol < 3-t3, ae : . 
| cae eT) 
gives : 
__ a ol ag : 
lad a+3 ao VA (a+ 8)2(a+8+1) 


If ~@ > 1 and j3 > 1, there is a unique mode at (a — 1)/(a + 3 — 2). If x hasa 
Be(z | a, 3) density, then y = 1 — x has a Be(y| 3.) density. If a = 3 = 1, x is 
said to have a uniform distribution Un(x | 0, 1) on (0, 1). 

By considering the transformed random quantity y = a + xr(b — a), where 
x has a Be(z | a, 3) density, the beta distribution can be generalised to any finite 
interval (@, 6). In particular, the uniform distribution Un(y | a,b) on (a. 6), 


Un(y|a.6) =(b-a)', a<y<b, 
has mean Ely] = (a + b)/2 and variance V[y] = (b — a)?/12. 


The Binomial-Beta Distribution 


A discrete random quantity x has a binomial-beta distribution with parameters a, 
Bandn (a >0.3>0,n =1,2,...) if its probability function Bb(x | a, 3,7) is 


Bb(x | a, B.n) =c (") T(a¢az)[(B4+n—-z7) xr=0..... n, 
where 
(a+ 3) 


“* Tayr(ar(atatn) 
The distribution is generated by the mixture 


! 
Bb(x | a,3,n) = 1 Bi(x|6,1) Be(@| a, 8) dé. 
0 


The mean and variance are given by 


noB (a+f3+n) 
(a+ 3)? (a+ 841) 


E[z] =n and V [2] = 


a 
at+3 
A mode is attained at the greatest integer A/|:r] which does not exceed 


_ (n+ 1)(a-1). 
~  ath—-2 ° 


amit s 


if 2,, is an integer, both x, and x,, — 1 are modes. If a = 3 = 1 we obtain the 
discrete uniform distribution, assigning mass (72 + 1)~! to each possible xr. 
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The Negative-Binomial-Beta Distribution 
A discrete random quantity x has a negative-binomial-beta distribution with pa- 


rameters a, 3 andr (a > 0.,.3 > O.r = 1.2....) if its probability function 
Nbb(.r | a. 3.77) is 


w= Qh... 


- of can Soh eS T(3+4+.4) 
Nbb(:r|a.g.r) =e ( ) 7m 


r—-| Te ear a 
where 
Tia + dr(atr) 


~ T(a)P(3) 
The distribution is generated by the mixture 


! 
Nbb(r | a. 3.7) = i Nb(sr | @.r) Be(@ la. 3) dé. 
AV 


The mean is F'[7] = r:3(a — 1) '.@ > 1, and the variance is given by 
ro F +34+r-1 r3 


Mela ee (a — 1a — 2) 


a> 2. 
a-l 


The Gamma Distribution 


A continuous random quantity .r has a gamma distribution with parameters a and 3 
(a > 0.;3 > 0) if its density function Ga(.r | a. .3) is 


Ga(rla.3) =e eo or > 0. 


where ¢ = :3°/T(a). Systematic application of the gamma integral 


i rote Fdy = EL 
0 7 


gives E[x] = a/3and V [x] = a/,3°. Ifa > 1. there is aunique mode at (a — 1)/:3: 
if a < | there are no modes (the density is unbounded). 
If = 1, ris said to have an exponential Ex(.r | 3) distribution with parameter 
3 and density 
Ex(r|3) = 36° 7. 9 > 0. 


The mode of an exponential distribution is located at zero. If .3 = 1. x is said to 
have an Erlang distribution with parameter «. If a = v/2..3 = 1/2..r is said to 
have a (central) chi-squared {\*) distribution with parameter v (often referred to 
as degrees of freedom) and density denoted by \7(2°| 7) or \?(.). 

By considering the transformed random quantities y= a+ ror: = br, 
where .r has a Ga(.r | cr. 3) density. the gamma distribution can be generalised to the 
ranges (@. x) or (—o.6). Moreover. the sum of & independent gamma random 
quantities with parameters (a,.:3), ¢ = 1.....4. isa gamma random quantity with 
parameters ayy + +++ + a, and .3. 
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The Inverted-Gamma Distribution 


A continuous random quantity x has an inverted-gamma distribution with param- 
eters a and 3 (a > 0, 8 > 0) if its density function Ig(z | a, (3) is 


Ig(zla,3) = ex MeFi a > 0, 


where c = 3°/T'(a). Systematic application of the gamma integral gives 


Bll= oy: eo 
ry) = ca 
Wl G2 ita say a> 2. 


There is a unique mode at 3/(a + 1). The term inverted-gamma derives from the 
easily established fact that if y has a Ga(y|a. 3) density then z = y~! has an 
Ig(z | a, 8) density. 

If x has an inverted-gamma distribution with a = v/2, 3 = 1/2, then z is 
said to have an inverted-x2 distribution. 

A continuous random quantity y has a square-root inverted-gamma density, 
Ga~"'/2)(y | @, 3), if x = y~? has a Ga(z | a. 3) density, 


The Poisson-Gamma Distribution 


A discrete random quantity z has a Poisson-gamma distribution with parameters 
a, Bandv (a > 0,3 >0,v > 0) if its probability function Pg(z | a, 3, v) is 


T(a+z) vt - 
Pele lat) =e ay ee 2 VE SO.) Qe. 


where c = 3°/I'(a). The distribution is generated by the mixture 


Pg(x}a,3,v) = | Pn(x|vAd) Ga(Ala, i) dd. 
0 


This compound Poisson distribution is, in fact, a generalisation of the negative 
binomial distribution Nb(z | a, 3/( + v)), previously defined only for integer a. 
The mean is E[z] = va//3, and the variance is V[z] = va(j + v)/3?. Moreover, 
if av > 3 + v, there is a mode at the least integer not less than (v(a — 1)//3) — 1; 
if av = 3 + v, there are two modes at 0 and 1; ifav < 8+v, M[z] =0. 
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The Gamma-Gamma Distribution 


A continuous random quantity 2 has a gamma-gamma distribution with parameters 
a, 3andn (a > 0.39 >0.n = 1.2....) if its probability function Gg(r la. 3.1) 


Is 


; = 3° (a + n) Pi I : 
Gg(a Jal. n) = Tia) FO) (dea > or > 0. 


The distribution is generated by the mixture 
Gg(rla.din) = | Ga(.r |. A) Ga(A | a..3) dA. 
Jib 


The mean and variance are given by 


3 


es 


Elz] =n 


a>. 


epi nee  are)) _ 
Vil reac) de 


The Pareto Distribution 


A continuous random quantity x has a Pareto distribution with parameters a and 
3 (a > 0,33 > 0) if its density function Pa(.r { a. .3) is 


Parfa.d) ser Or >t. 


where ¢ = a3". The mean and variance are given by 


3 
El) = — ifa >] 
a- 
ae 
V[«] = ————————— - ifa>2. 
Ie] (a - 1)2(a = 2) ns 


The mode is A/[.] = .3. The distribution is generated by the mixture 


FO 


Pa(r|a.3) = ‘a Ex(x — 3] 4) Ga(@|a..3) dé. 


A continuous random quantity y has an inverted-Pareto density Ip(y | a. .3) if 
v=y' hasa Pa(sr |. 3) density. 
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The Normal Distribution 


A continuous random quantity x has a normal distribution with parameters jz and A 
( € R.A > 0) if its density function N(x | 4, A) is 


N(x | pt, A) =cexp{-3(e-n)'. rer, 


Xd 1/2 


The distribution is symmetrical about x = yz. The mean and mode are E{z] = 
Mz] = «and the variance is V(x] = A~!. so that A here represents the precision 
of the distribution. Alternatively, N(z | jz.) is denoted by N(z | 2,077), where 
= Via] is the variance. If p = 0, A = 1, x is said to have a ee Opn 
distribution, with distribution function ® given by 


where 


@(r) = cept at ) 


<= f 
If y = \/2(x — x) = (x — )/o, where x has a normal density N(z | jz, 4), then 
y has a N(y|0,1) (standard) density. In general, if y = a + war bj.x,, where 
the x; are independent with N(z; | 44;, A,) densities, then y has a normal density, 
N(yla+ ae biti, 4) where X = (S7*_, 62/d;)7', a weighted harmonic mean 
of the individual precisions. 

If x1,...,2, are mutually independent standard normal random quantities, 
then z = a , 2? has a (central) x? distribution. 


The Non-central ,? Distribution 

A continuous random quantity x has a non-central x? distribution with parameters 
v (degrees of freedom) and 4 (non-centrality) (v > 0, > 0) if its density function 
P(x|v,) is 


x(a |v,d) = 2 Pn (i] A/2) 2 (2 |v + 28), 
i=0 

i.e, a mixture of central \? distributions with Poisson weights. It reduces to 
a central x?(v) when \ = 0. The mean and variance are E[z] = v + 2 and 
V(a] = 2(v + 2A). The distribution is unimodal; the mode occurs at the value 
Mz] such that x?(Af[z] |v. A) = x?(Mf[z] |v — 2, A). 

If r,.... 2, are mutually independent normal random quantities N(z; | jz,, 1). 
then z = 5-*_, 2? has a non-central x? distribution. 


2 hs 
x fete oe at): 
The sum of k independent non-central x? distributions with parameters (v;, \;) 
is anon-central x? with parameters 1, + --- +, and A, +--+ + Ax. 


122 3 Generalisations 


The Logistic Distribution 
A continuous random quantity x has a /ogistic distribution with parameters « and ;3 
(a € R,.3 > O) if its density function Lo(x | a. .3) is 


tr )) 


Lo(zla. 3) =¢ veR. 


where c = 37'. An alternative expression for the density function is 


| saci Lfr-a 
43 sb 2 3 P 


so that the logistic is sometimes called the sech-squared distribution. 
The logistic distribution is most simply expressed in terms of its distribution 


function, . 
rune foae-(S2)])" 


The distribution is symmetrical about 2 = «a. The mean and mode are given by 
E(x) = A(z] = a, and the variance is V[r] = 3?3?/3. 


The Student (t) Distribution 
A continuous random quantity x has a Student distribution with parameters jt. > 
anda (4 € R.A > O.a > 0) if its density St(x | je. A.) is 


amtoe dy? 


St(z |p. A.a) =e f + xr ~ wn re. 


where 
__ T((a@+1/2)) (\" 
“= Pla/2rii/2) La} | 


The distribution is symmetrical about .r = ;2, and has a unique mode A/ [zr] = 
yt. The mean and variance are 


Flrjsp. if a>. 


; j y ee 
Vir] = i ao - ifa > 2 


2) 


The parameter a is usually referred to as the degrees of freedom of the distribution. 
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The distribution is generated (Dickey, 1968) by the mixture 


~ aa 
: ’ A, = N ’ X G an*o d: oy 
St(x | A, a) [ (x | #, Ay) a(yls 5) y 
and includes the normal distribution as a limiting case, since 
N(x] u.A) = lim St(r] pA, a). 


Ify = '/2(z—), where x has a St(zx | 2, \, @) density, then y has a (standard) 
student density St(y|0,1,a@). If a = 1, z is said to have a Cauchy distribution, 
with density Ca(z | 1, A). 

If x has a standard normal distribution, y has a 2 distribution, and x and y 
are mutually independent, then 


x 
(y/v)? 
has a standard Student density St(z | 0,1, v). 


The Snedecor (F) Distribution 
A continuous random quantity x has a Snedecor, or Fisher, distribution with pa- 
rameters a and {3 (degrees of freedom) (a > 0, G > 0) if its density Fs (x | a, 3) 


1S 
la/2)-1 


Bs (2 | 0.8) = ¢ ee ara 


z>d, 


nn P ((a + 8)/2) 
= a : a 3/2, 
© T(a/2)0 (8/2) * cee 
If 8 > 2, Elz] = 8/(8 — 2) and there is a unique mode at [9/(8 + 2)}[(a — 2)/a}; 
moreover, if 3 > 4, 


8B (a+ 8-2) 


Mel 2cg-a @-3F 


If x and y are independent random quantities with central ,? distributions, 
with, respectively, 1; and 1 degrees of freedom, then 


_ (2/4) 
ae (y/v2) 


has a Snedecor distribution with v, and 2 degrees of freedom. 


Relationships between some of the distributions described above can be es- 
tablished using the techniques described in Section 3.2.1. For a geometrical inter- 
pretation of some of these relations, see Bailey (1992). 
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Example 3.1. (Gamma and \° distributions). Suppose that .r has a Ga(.r | a. 3) den- 
sity and let y = 2:31. Then, for y > 0. we have 


+) O(y/23){ 1 
23 oy | Tay? 


1 


; 4 
pily) = p, ( peel exp{-2} = Ga(yia. 4). 

so that y has a \°(y|2a) density. Since \° distributions are extensively tabulated, this 
relationship provides a useful basis for numerical work involving any gamma distribution. 


Example 3.2. (Beta, Binomial and Snedecor (F ) distributions). Suppose that .r has 
a density Be(r|a..3) and let y = dela(] —.r)f'. Then. for y > 0, and noting that 
r= ays + ay] ' we have 


Vflayj 3+ ay) '}) 
Oy 


V(a +33) ay ) : 3 PY ad 

P(a)P(C3) (3 + ay J+ay (3+ ay)? 

Mat s)ats! yo! pee 

= = sy 20.24), 
fara: Gaaie ee 


PAY) = p, (yl + ay} | 


so that y has the stated Snedecor (f°) density. Binomial probabilities may also be obtained 
from the F distribution using the exact relation between their distribution functions given 
by Peizer and Pratt (1968) 


(av +1)(1 — @) 


Fai(a|8.n) = Fe | Go 


Anew) Wet IT]. w=O).. cu, 


Since F distributions are extensively tabulated, these relationships provide a useful basis for 
numerical work involving any beta or binomial distribution. 


Example 3.3. (Approximate moments for transformed random quantities). Suppose 
that . has a Be(r | a. ;3) density, 0 < .r < 1, but that we are interested in the means and 
variances of the transformed random quantities 


y= (sr) = log (+): ye = g(r) = sin’! Vr. 


Recalling that 


: a # » wl p) 
Blels w= Vir) =o = 
oe ata ae az t+] 
and noting that 
tray 1 ” Qk (t rh v) 
nie) = r(k— 2)" g(r) = reese 
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1 g'(x) = a-(l-2) 
rle(y ri?" Gy} = gi} _ ri? 
application of Proposition 3.3 immediately yields the following approximations: 


Lot Ht 1 o[#-(l-4) 
Flu] * lon (; “) +9? ia — pp 


gh(z) = 


a? 
Vin) = —— - 
bw) w?(1 — ye)? 
fap & Qein=! Ta) we =) 
Ely] = 2sin Vi + 2° ped — pe 


a 1 


0) atatl 

We note, in particular, that if pz: 5 the second (correction) terms in the mean approximations 
will be small, and that, for all jz, the variance is “stabilised” (i.e.. does not depend on 4) 
under the second transformation. 


V iy] = 


3.2.3. Convergence and Limit Theorems 


Within the countably additive framework for probability which we are currently re- 
viewing, much of the powerful resulting mathematical machinery rests on various 
notions of limit process. We shall summarise a few of the main ideas and re- 
sults, beginning with the four most widely used notions of convergence for random 
quantities. 
Definition 3.8. (Convergence). A sequence ©,,22,..., of random quantities: 
(i) converges in mean square to a random quantity x if and only if 


lim E(x; — z)*] = 0; 
IX 
(ii) converges almost surely to a random quantity x if and only if 
P ({w; lim x,(w) = x(w)}) =]; 


in other words, if 1,(w) tends to x(w) for all w except those lying ina 
set of P-measure zero; 


(iii) converges in probability to a random quantity x if and only if 
for alle > 0, lim P({w; |x,(w) — r(w)| > €}) = 0; 
(iv) converges in distribution to a random quantity x if and only if the corre- 
sponding distribution functions are such that 


lim F,(t) = F(t) 


at all continuity points t of F in R; we denote this by F; > F. 
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Convergence in mean square implies convergence in probability; for finite 
random quantities, almost sure convergence also implies convergence in probability. 
Convergence in probability implies convergence in distribution, the converse is 
false. Convergence in distribution is completely determined by the distribution 
functions: the corresponding random quantities need not be defined on the same 
probability space. Moreover. 

(i) F, — F if and only if, for every bounded continuous function g, the sequence 
of the expected values E[g(.,)}] with respect to F, converges to the expected 
value E[g(x)] with respect to F; 

(ii) if F, — F and o;(t) and @(¢) are the corresponding characteristic functions, 
then 6;(t) — (ft) for all t € R; the converse also holds, provided that @(t) is 
continuous at f = 0; 

(iii) (Helly’s theorem) Given a sequence {F). F)....} of distributions functions 
such that for each ¢ > () there exists an a such that for all / sufficiently 
large F,(a) — F,{(—a) > 1 - 2, there exists a distribution function F and a 
subsequence F;,. F,.... . such that f — F, 


An important ae of limit results, the so-called laws of large numbers, link 
the limiting behaviour of averages of (independent) random quantities with their 
expectations. Some of the most basic of these are the following: 


(i) Wfarypewa...., are independent, identically distributed random quantities with 
ate < x and Elr;| = js. then the sequence of random quantities T,, = 
a) RT GOR fae re converges in mean square (and hence in prob- 


ability) to 4s; that is to say, to a degenerate, discrete random quantity which 
assigns probability one to yu. 


(ti) The weak law of large numbers. Vf 2 ,..4'2.... are independent. identically 
distributed random eae ey Elx,|] = pp < x. then the sequence of 
random quantities T,,. 1 = 1.2..... converges in probability to ;:. 

(ili) The strong law of large numbers. Under the same conditions as in (11). F,,. 
n=1.2...., converges almost surely to j1. 


In addition, there is a further class of limit results which characterises in more 
detail the properties of the distance between the sequence and the limit values. Two 
important examples are the following: 

(i) The central limit theorem, If .v,..02.... are independent identically distributed 
random quantities with E[.r,) = yc and V[r,] = 07 < x. for all i, then the 
sequence of standardised random quantities 


T,— yl 
=H Ee eS 


=u abla 5 Sate 


converges in distribution to the standard normal distribution. 
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(ii) The law of the iterated logarithm. Under the conditions assumed for the central 
limit theorem, 


lim sup ——# (2 log log ny? =|. 


n-+06 o/J/n 


There are enormously wide-ranging variations and generalisations of these 
results, but we shall rarely need to go beyond the above in our subsequent discussion. 


3.2.4 Random Vectors, Bayes’ Theorem 


A random quantity represents a numerical summary of the potential outcomes 
in an uncertain situation. However, in general each outcome has many different 
numerical summaries which may be associated with it. For example, a description 
of the state of the international commodity market would typically involve a whole 
complex of price information; the birth of an infant might be recorded in terms of 
weight and heart-rate measurements, as well as an encoding (for example, using a 0- 
1 convention) of its sex. It is necessary therefore to have available the mathematical 
apparatus for handling a vector of numerical information. 
Formally, we wish to define a mapping 


zr:2—-XcR 


which associates a vector z(w) of & real numbers with each elementary outcome 
w of 9. As in the case of (univariate) random quantities, we move the focus of 
attention from the underlying probability space {Q, F. P} to the context of R* and 
an induced probability measure P,. However, we shall again wish to ensure that P, 
is well-defined for particular subsets of R* and this puts mathematical constraints 
on the form of the function x. Generalising our earlier discussion given in Section 
3.2.1, we shall take this class of subsets to be the smallest o-algebra, 8, containing 
all forms of k-dimensional interval (the so-called Borel sets of R*). This then 
prompts the following definition. 


Definition 3.9. (Random vector). A random vector x on a probability space 
{Q.F, P} is a function xz :Q — X CR such that 


z'(B)éF. forall BEB. 


For a random vector z, the induced probability measure P, is defined in the 
natural way by 
PB) = P(a'(B)). BEB. 


The possible forms of distribution for x, P,, are potentially much more com- 
plicated for a random vector than in the case of a single random quantity, in that they 
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not only describe the uncertainty about each of the individual component random 
quantities in the vector, but also the dependencies among them. 

As in the one-dimensional case, we can distinguish discrete distributions, 
where x takes only a countable number of possible values and the distribution can 
be described by the probability (mass) function 


pele) = P({w: a(w) = x}). 


and (absolutely) continuous distributions. where the distribution may be described 
by a density function p,(a) such that 


P,(B) = 1 pr(x) dx = / Dy(@ pec e ee. ry) dvypeerday. BEB. 
B JB 


The distribution function of a random vector a is the real-valued function 
Fy : R* — [0. 1] defined by 


Py (a) = Fy(ry...... re) = Pe{(—x.y] i... (xref. 


In addition. we could have cases where some of the components are discrete 
and others are continuous. Some components. of course, might themselves be a 
mixture of the two types. In what follows, we shall usually present our discussion 
using the notation for the continuous case. It will always be clear from the context 
how to reinterpret things in the discrete (or mixed) cases. 


The density py(a) = py(ay....-. ty) of the random vector x is often referred 
to as the joint density of the random quantities .r;...... r,. If the random vector ic 
is partitioned into x = (y.z). say. where y = (.r)...... Ye), Z = (pepe ee ede 


the marginal density for the vector y is given by 


piv) = [ Ply. z) dz. 
Jyh 


or alternatively, dropping the subscripts without danger of confusion, 


PE aes cs ry) = | Cn rye) day. se dag. 
J phot 


This operation of passing from a joint to a marginal density occurs so com- 
monly in Bayesian inference contexts (see Chapter 5, in particular) that it is useful to 
have available a simple alternative notation, emphasising the operation itself. rather 
than the technical integration required. To denote the marginalisation operation 
we shall therefore write 

Pel. 2) — py Y): 


The conditional densiry for the random vector z. given that y(w) = y. is 


defined by oe 
Psy(Z(Y) ng as . 
ply) 
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or, alternatively, again dropping the subscripts for convenience, 


| Go 2 nee oS A 


We shall almost always use the generic subscript-free notation for densities. It is 
therefore important to remember that the functional forms of the various marginal 
and conditional densities will typically differ. 


Proposition 3.4. (Generalised Bayes’ theorem). 


_ Pay(2 | y)py(y) 
Pyzly | z)= ee 


Proof. Exchanging the roles of y and z in the above, it is obvious that 


Px(L) = Pyy(Z| y)Py(y) = Pyz(y | z)p.(z), 


which immediately yields the result. 


It is often convenient to re-express Bayes’ theorem in the simple proportion- 
ality form 
Pye(y | z) x Puy (2 | Y)Py(y), 


since the right-hand side contains all the information required to reconstruct the 
normalising constant, 


-1 


fp.(2)"' = | [rvtelwrs) dy) 


should the latter be needed explicitly. In many cases, however, it is not explicitly 
required since the “shape” of pyi-(y | z) is all that one needs to know. 


In fact, this latter observation is often extremely useful for avoiding unnecessary 
detail when carrying out manipulations involving Bayes’ theorem. More gener- 
ally, we note that if a density function p(a) can be expressed in the form cq(z), 
where q is a function and « is a constant, not depending on z, then 


fa) dz=c', since [we dx = P(R') =1. 
gk pk 


Any such q({a) will be referred to as a kernel of the density p(x). The propor- 
tionality form of Bayes’ theorem then makes it clear that, up to the final stage 
of calculating the normalising constant, we can always just work with kernels of 
densities. 
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For further technical discussion of the generalised Bayes’ theorem, see Mou- 
chart (1976), Hartigan (1983, Chapter 3) and Wasserman and Kadane (1990). 

As with marginalising, the Bayes” theorem operation of passing from a condi- 
tional and a marginal density to the “other” conditional density is also fundamental 
to Bayesian inference and, again, it is useful to have available an alternative nota- 
tion. To denote the Bayes’ theorem operation, we shall therefore write 


Pey(Z | y) & pry) = Pyel(ylz). 


In more explicit terms, and dropping the subscripts on densities, Bayes’ the- 
orem can be written in the form 


DUN Le | Xpaper eee. r;) 
= P(X pores Pel Ppswsd vs ry) play... <tr) 
Ppl tncievcc Tolpis t;) p(ty.-... ay) dt, +++ dr; 
Manipulations based on this form will underlie the greater part of the ideas 
and results to be developed in subsequent chapters. 


In particular, extending the use of the terms given in Chapter 2, we shall typi- 


cally interpret densities such as p(.r;......r,) as describing beliefs for the ran- 
dom quantities .r)...... r, before (i.e.. prior to) observing the random quantities 
Up peee ee dp and p(y. cee. | ur.y..... ag) as describing beliefs after (i.e.. pos- 
terior to) observing wy.).. 2. 6p. 


Often, manipulation is simplified if independence or conditional independence 
assumptions can be made. For example, if .r;..... a, were independent we would 
have 


If x is a random vector defined on {(2.F. P} such that zr : 02 — X CR 
and if g : R* = Y C R" (h < k) isa function such that (go x) ~'(B) € F for 
all B € B, then g oz is also a random vector. We shall typically denote g o x by 
g(x), and, whenever we refer to such vector functions of a random vector 2, it is 
to be understood that the composite function is indeed a random vector. Writing 
y=g(x), the random vector y induces a probability space {#". B. P,} where 


P,(B) = P,(g7'(B)) = P((gow) '(B)). Be B. 
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and distribution and density functions F,, py are defined in the obvious way. These 
forms are easily related to F,, p,. In particular, if g is a one-to-one differentiable 
function with inverse g~', we have, for each y € Y, 


Py(y) = px(g7'(y)) |Jg-1 (y)|, 
where 


Jg-i(y) = 2 


is the Jacobian of the transformation g~!, defined by 


Ohi(y) ca 
J = = 1 6LQ= ], ere wk ‘ 


where 

hi(y) = [g-*(y)], - 
If h < k, we are usually able to define an appropriate z, with dimension k — h, 
such that w = f(x) = (y,z) = (g(a), z) is a one-to-one function with inverse 
f~', and then proceed in two steps to obtain Py(y) by first obtaining 


Pw(w) = Py(¥s 2) = Pux(F'(w))|F p-1(w)], 


and then marginalising to 


pty) =f , rraltn2) de. 
gk-h 
The expectation concept generalises to the case of random vectors in an ob- 


vious way. 


Definition 3.10. (Expectation of a random vector). If x, y are random vec- 
tors such thaty = g(x), x2:Q—4 XC Rig: Rt —Y CR (h < k), the 
expectation of y, Ely] = E[g(ax)], is a vector whose ith component, 


Ely;) = Elg(x)): = Elgi(x)}, i=1,...,h, 


is defined by either 
oD nH Dy(y =) YViPy; (yi) = >= 9i(z gi(x 
yey xeX 
or 


ie nprly)dy =f wey (udu = Ls gi(x)px(x) dx, 


for the discrete and absolutely continuous cases, respectively, where all the 
equalities are to be interpreted in the sense that if either side exists, so does 
the other and they are equal. 
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In particular, the forms defined by E [I1*_, x""] are called the moments of « of 
order n = rn; +---+ ry. Important special cases include the first-order moments. 
Blah = dense k, and the second-order moments, E[r7]. (i = L..... k). Elrpr,]. 
(l<iFj<h). Ifry.....2, are independent, then E [Wjr'’] = UL, E[a;"). 
The covariance between 2:; and x; is defined by 


Claj.a,] = E[(x, — Ela) (ay - E{x,))| = Ele,c;] - Ele] Efe). 
and the correlation by 


‘ = Cir. w,] ; 
Riz)..20,] = Tminvin le aE GPADE 

The Cauchy-Schwarz inequality establishes that | R[.7,..c,]| < 1. The ex- 
pectation vector with components E[2x,]..... E|.r,] is also called the mean vector, 
E[z], of a; the k x k matrix with (i. j)th element C[2;..7,] is called the covariance 
matrix, V(x], of x. If the components of x are independent random quantities. 
V [a] reduces to a diagonal matrix with (i. i)th entry given by W[,,]. 

As in the case of a single random quantity, exact forms for moments of an 
arbitrary transformation. y = g(a), are not available. We shall not need very 
general results in this area, but the following will occasionally prove useful. 


Proposition 3.5. (Approximate mean and covariance). If x is a random 
vector in R*, with Ef] = p, V[al = Land y = g(x) is a one-to-one 
transformation of x such that g°' exists, then, subject to conditions on the 
distribution of x and on the smoothness of g. 


E[(g(x)),] = Elg(e)] © gi(m) + $ tr [ZV7g,()] 


V (9(x)] * Jg(H) 5 Jg(n). 


where, fori =1,....k, 
2 _ 07 g,(x) 
(V9) 1 = ea, : 


O ; 
(Jg(u)),, = 


rap. 


where tr[.] denotes the trace of a matrix argument. 


Proof. This follows straightforwardly from a multivariate Taylor expansion: 
the details are tedious and we will omitthem here.  g 
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A note on measure theory. Readers familiar with measure theory will be aware 
that there are many subtle steps in passing to a density representation of the 
probability measure Py. In particular, a detailed rigorous treatment of densities 
(Radon-Nikodym derivatives) requires statements about dominating measures 
and comments on the versions assumed for such densities. Readers unfamiliar 
with measure theory will already have assumed —correctly! — that we shall almost 
always be dealing with the “standard” versions of probability mass functions and 
densities (corresponding to counting and Lebesgue dominating measures and 
“smoothly” defined). Only occasionally, in Chapter 4, do we refer to general, 
i.e., non-density, forms. 


3.2.5 Some Particular Multivariate Distributions 


We conclude our review of probability theory with a selection of the more frequently 
used multivariate probability distributions; that is to say, distributions for random 
vectors. As in Section 3.2.3, no very detailed discussion will be given: see, for 
example Wilks (1962), Johnson and Kotz (1969, 1972) and DeGroot (1970) for 
further information. 


The Multinomial Distribution 


A discrete random vector 2 = (2),...,2%) has a multinomial distribution of di- 
mension k, with parameters @ = (01,....0,) andn (0 < 6; < 1,55,0 < 1, 
n = 1,2,...) if its probability function Mu;(a|@,n), for z; = 0.1,2,..., with 


k . 
Die Ti S 7, is 


k 
Mu, (az Qn) = =;—_* _ — 6" (1 - Oi) = 
e160) = le ) 


The mean vector and covariance matrix are given by 
E[z;| =n6;, Vix] = n6,1-6), Clai,zr,] = —n6,8,. 
The mode(s) of the distribution is (are) located near FE{a], satisfying 
nO; < M[z,]) <(n+k-1)6, i=1.....k: 


these inequalities, with the condition Sy x; <n, restrict the possible modes to a 
relatively few points. 
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The marginal distribution of a!) = (ay.....4 Yin), me < kis the multinomial 
Mu,,, (a!) |@,,..., 0,,, 7). The conditional distribution of 2”) given the remain- 
ing z,'S is also multinomial, and it depends on the remaining -r,'s only through their 
sum s = S7*_,, , xi; specifically, 


( | Bin 
pee eter es rr) = Muy | 2!" lea... =a ns). 
sr, | wae 


If x = (z)..... 2%) has density Mu;(a| 0.1) then y = (y,....y) where 
YS ep te at ay WSL yer tee tay. I <t<ck. 
has density Mu;(y | @. 7}, where 


MO; = 6, eT vee OF = Gi, 141 Be IO 
If z is the sum of i independent random vectors having multinomial densities 
with parameters (8.7,),7 = 1..... m, then z also has a multinomial density with 
parameters @ and (nr, +°--+1,,,). If k = 1, Mu,(ax | 0, 1) reduces to the binomial 
density Bi(a | @. 17). 

If y......t, are & independent Poisson random quantities with densities 
Pn(2x;|A,), then the joint distribution of 2 = (v)...... ty.) given a vou 
is multinomial Mu(r | 0.77), with 0, = A,/ ie » 


The Dirichlet Distribution 


A continuous random vector @ = (.)...... ry) has a Dirichlet distribution of 
dimension &, with parameters @ = (ay.....Qp41) (a, > OF = 1... A+ 1) if 
its probability density Di,(a | a), 0 <r; < lands, +-+--+ay < Lis 


Di,(x| a) = ert pth (1- ae 7a alae, ; 
where 
iB eae M(a,) 


If k = 1, Di, (a | a) reduces to the beta density Be(.r |). a2). Inthe general case. 
the mean vector and covariance matrix are given by 


Ble = sari Viel= See Chad = eee 
aptly 1+ 30") a, Sy 


oH}. 
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Ifa; > 1,1 =1,...,k, there is a mode given by 


aj—- 1 
M2) == 
ied Dy —k-1 


The marginal distribution of 2”) = (2,,...,2..),m < k, is the Dirichlet 
p(a'")) = Din (a'™ lay Seriece ais ype): 


The conditional distribution, given 2,,4,,..., 2%, Of 


’ Ti 
ase ee 
= eer vj 


is also Dirichlet, Di,,(z},....2),, |Q1,---. Qi, Qu41)- In particular, 


bf m 


P(X; | Emsts ++ s Ce) = Be(zj |i. 0" aj +0441 - 0%), F=1...,m. 


Moreover, if x = (2,....2,) has density Di, (a | a), then y = (y;,.... y) where 
Yt SX te + Li, sees Yt = Li yet terete, Lt <k. 
has density Di, (y | 3), where 


fon =Q@,;+°:: + Qi). he P= iy th + see t Og, Gray = Ake. 


The Multinomial-Dirichiet Distribution 
Adiscrete random vector 2 = (x;,.-.,.©,) has amultinomial-Dirichlet distribution 
of dimension k, with parameters @ = (a ....,Q%41) and m where a, > 0, and 
nm = 1,2,..., if its probability function Md,(x |a,n), for zr; = 0,1,2,.... with 
ie zi <n,is 
k+i tj] 
Md,(x}a.n) = c]] — 


j=l 
where al! = T]j-1(e + j — 1) defines the ascending factorial function, with 
Lee1 =N~ ei xj; and 
ni 


ket fn] , 
iS 
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The mean vector and covariance matrix are given by 


Elrj=aop. pi = 


Vici] = np, (1 = p,) 
’ k=! t ’ 
1+ yj 1 Oy 
n+ Dae 0; 
Clr.r,) = -——Sa IP, 
1+ aye ay 
The marginal distribution of the subset {.r)...... r,} is amultinomial-Dirichlet 
with parameters {a)..... ay. = ‘taj - 5... 0, } and ». In particular, the mar- 
ginal distribution of x, is the binomial-beta Bb(.x, fae} a, —a,). More- 
over, the conditional distribution of {.r.-)......2%} given {ary...... r.} is also 
multinomial-Dirichlet, with parameters {Q,.).....04. y: LaN =i x, a, }and 


n— $0 .., 2). For an interesting characterization of this distribution, see Basu and 
Pereira (1983). 


The Normal-Gamma Distribution 


A continuous bivariate random vector (2°. y) has a normal-gamma distribution, 
with parameters 2. A.a and 3. (2 € RA > Oca > 0.3 > O) if its density 
Ng(r.y| fe. A. a..3) is 


Nge(r.y| pe A.a.3) = N(v [ge Ay)Ga(yla.3). re Ry > 0. 


where the normal and gamma densities are defined in Section 3.2.2. It is clear 
from the definition, that the conditional density of . given y is N(.r | j1. Ay) and 
that the marginal density of y is Ga(y | «. :3). Moreover. the marginal density of .° 
is St(r |p, Aa/3. 2a). 

The shape of a normal-gamma distribution is illustrated in Figure 3.1, where 
the probability density of Ng(xr. y | 0.1.5.5) is displayed both as a surface and in 
terms of equal density contours. 


The Multivariate Normal Distribution 


A continuous random vector x = ()...... ry.) has a multivariate normal distribu- 
tion of dimension A, with parameters pe = (ji)..... jt.) and A, where pw € R* and A 
isa hk x k& symmetric positive-definite matrix. if its probability density Ny (a | pe. A) 
is 


Ni(@ |p. A) = e oxp{-4(a- pw) Ala — wp. we B. 
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wee J 
GaabE 


-2 0 2 


Figure 3.1. The Normal-gamma density Ng(x, y | 0.1, 5,5) 


where c = (27)~*/?|\|1/2, 
If k = 1, so that A is a scalar, 4, Ny (a | 42, A) reduces to the univariate normal 
density N(z |, 4). 


In the general case, E[x;] = yj, and, with & = A~! of general element o;,, 
V[z,] = oi; and C[z,..x,] = o;;, so that V[x] = A~'. The parameter yz therefore 
labels the mean vector and the parameter A the precision matrix (the inverse of the 
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covariance matrix, %). If y = Ag, where A is anim x k matrix of real numbers 
such that AYA’ is non-singular, then y has density N,,,(y| Aw. (AEA‘)"!). 

In particular, the marginal density for any subvector of x is (multivariate) 
normal, of appropriate dimension, with mean vector and covariance matrix given 
by the corresponding subvector of yz and submatrix of A’. 

Moreover, if z = (2,22) is a partition of 2, with x; having dimension k,. 
and ky + ko = k, and if the corresponding partitions of jz and X are 


_{h Res An Ar 
p= : = ; 
By An Ax 
then the conditional density of x; given x» is also (multivariate) normal, of dimen- 
sion k, with mean vector and precision matrix given, respectively. by 


By — Ap Ape(@ — Hy) and Aj;. 
The random quantity y = (@ — ys)'A(ax — ys) has a \?(y| &) density. 
We also note that, from the form of the multivariate normal density. we can 


deduce the integral formula 


(aye 
|A[! 


[exl-He = "Me = w)} de = 
i 


The Wishart Distribution 


A symmetric, positive-definite matrix z of random quantities 2,; = 1,;. for i = 
Lyte ig Ri p= Teens k, has a Wishart distribution of dimension k, with parameters 
a and § (with 2a > k — 1 and Gak x & symmetric, nonsingular matrix), if the 
density Wi,(a | a, 8) of the &(4 + 1)/2 dimensional random vector of the distinct 
entries of x is 


Wix(a | a. 8) = cla|""'*!'? exp{— tr (Bx)}. 


where ¢ = [8|*/Tx(a@). 


: Ta ne en eae 
U.(a) = hth bs eras 


is the generalised gamma function and tr(.), as before, denotes the trace of a matrix 
argument. If & = 1, so that G is ascalar 33, then 11",(z | a. 8) reduces to the gamma 
density Ga(x | a. 3). 
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If {z,,...,%,} is arandom sample of size 2 > 1 from a multivariate normal 
Ny(a; |p, A), and & = n7! S> a;, then Z is Ny (Z| ps, nA), and 


S= HC - Z)(2x; ed =)! 
i=] 


is independent of %, and has a Wishart distribution Wix(S | $(n — 1), 5A). 

The following properties of the Wishart distribution are easily established: 
Elz] = a8! and E{a~'] = (a — (k + 1)/2)7'B; if y = AvA‘ where A is 
an m x & matrix (m < k) of real numbers, then y has a Wishart distribution of 
dimension m with parameters a and (AG@™' A')~', if the latter exists; in particular, 
if x and 3~' conformably partition into 


os &. sae B= ts ee) : 
T2, 22 O2, 922 

where 21), 0, are square h x h matrices (1 < h < k), then 2, has a Wishart 
distribution of dimension h with parameters a and (o |, )~'. Moreover, if 2,,..., 2s 
are independent k x k random matrices, each with a Wishart distribution, with 
parameters a,, 3,1 = 1,....s, then 2, + ---+ 2, also has a Wishart distribution, 
with parameters a, + --- + a, and B. 

We note that, from the form of the Wishart density, we can deduce the integral 


formula 


[inte 89? exp{—m(Bx)} dx =o 


the integration being understood to be with respect to the k(k + 1)/2 distinct 
elements of the matrix x. 


The Multivariate Student Distribution 


A continuous random vector x = (z),...,24) has a multivariate Student distri- 
bution of dimension k, with parameters pe = (jt)..... fe), A anda (pw € RK, 
A a symmetric, positive-definite k x k matrix, a > 0) if its probability density 
Su.(x | 2, A, a) is 


1 -—(a+k)/2 
St(x |p, A,a) =c 1+ —(@— )'A(e— p) . cep, 


where _ T((a + k)/2) 


c= Ta/anee nh : 


If k = 1, so that A is a scalar, X, then St,(z | 4s, A, @) reduces to the univari- 
ate Student density St(z |, ,@). In the general case, E{a] = ye and V[a] = 
A~!(a/(a — 2)). Although not exactly equal to the inverse of the covariance ma- 
trix, the parameter A is often referred to as the precision matrix of the distribution. 
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If y = Az, where A is an in x k matrix (m < k) of real numbers such that 
Ad'A! is non-singular, then y has density St,,(y| Ap. (AA 'A')“!.a). In 
particular, the marginal density for any subvector of x is (multivariate) Student, 
of appropriate dimension, with mean vector and inverse of the precision matrix 
given by the corresponding subvector of yz and submatrix of X~'. Moreover, if 
x = (21.22) is a partition of x and the corresponding partitions of yz and X are 


given by 
_{ hh _ (An Ap). 
n= (i), r= (3 


then the conditional density of x), given x» is also (multivariate) Student, of 
dimension h,, with a + hy degrees of freedom, and mean vector and precision 
matrix, respectively, given by 
By — AT Aa(a@2 — By). 
x athy 
Teer (x2 — p2)!(Av» — AAA) (@2 — py) 
The random quantity y = (x — y2)!A(ax — ps) has an Fs(y| 4.) density. 


The Multivariate Normal-Gamma Distribution 

A continuous random vector 2 = (.t)...... r,) and a random quantity y have 
a joint multivariate normal-gamma distribution of dimension 4. with parameters 
pedA.a.3 (Gu € RK Xak x k symmetric. positive-definite matrix. a > 0 and 
3 > 0) if the joint probability density of # and y, Ng, (a, y| pw. A. a. 3) is 


Ng, (x.y |. A.a..3) = Na (ax | pe. Ay)Ga(y | a. 3). 


where the multivariate normal and gamma densities have already been defined. 
From the definition, the conditional density of x given y is Nj.(a | wu. Ay) and 
the marginal density of y is Ga(y|a. 3). Moreover. the marginal density of x is 
Sty (a | w.aT' 3A. 2a). 


The Multivariate Normal-Wishart Distribution 


A continuous random vector x and a symmetric, positive-definite matrix of random 
quantities y have a joint Normal- Wishart distribution of dimension 4, with param- 
eters pw. A.a.8 (pe R.A > VU. integer 2a > k — 1, and Bak x & symmetric. 
non-singular matrix), if the probability density of x and the A(k + 1)/2 distinct 
elements of y, Nw, (x. y |p. AL. B) is 


Nw,(xz.y| pe. Aa. 8) = Na (x| Ay) Wiz(y fa. B). 


where the multivariate normal and Wishart densities are as defined above. 

From the definition, the conditional density of x given y is N;(z |p. Ay) and 
the marginal density of y is Wi,(y | @. 8). Moreover, the marginal density of x is 
St. (a |p. AaB |. 2a). 
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The Bilateral Pareto Distribution 


A continuous bivariate random vector (z, y) has a bilateral Pareto distribution with 
parameters ,3y, 3), anda ({/3, 31} € R?, By < 4. a > 0) if its density function 
Pao(z, y| a, 8. 31) is 


Pa(x,y|c, 3p.) =e (y-x)*?, 2 <b y > Bh, 


where ¢ = «(a + 1)(3; — 3))*. The mean and variance are given by 


_ aBy — 3; (of iy, 
Ez] = rare ak Ely] = op ifa>l, 
(i, — By)? ; 
/ 1 = OO  * f 7 ‘ 
V [2] = V{y] (aia 3) ifa >2 


and the correlation between x and y is — «~'. The marginal distributions of 


t, = 8) — rand ty = y — (3) are both Pa(t | 3; — 3p, a). 


3.3 GENERALISED OPTIONS AND UTILITIES 


3.3.1. Motivation and Preliminaries 


For reasons of mathematical or descriptive convenience, it is common in statistical 
decision problems to consider sets of options which consist of part or all of the real 
line (as in problems of point estimation) or are part of some more general space. It 
is therefore desirable to extend the concepts and results of Chapter 2 to a much more 
general mathematical setting, going beyond finite, or even countable, frameworks, 
first by taking € to be a o-algebra and then suitably extending the fundamental 
notion of an option. 

In the finite case, an option was denoted by a = {c,| Ej. 7 € J}. with the 
straightforward interpretation that. if option a is chosen, c, is the consequence 
of the occurrence of the event E;. The extension of this function definition to 
infinite settings clearly requires some form of constructive limit process, analogous 
to that used in Lebesgue measure and integration theory in passing from simple 
(i.e., “step”) functions to more general functions. Since the development given in 
Chapter 2 led to the assessment of options in terms of their expected utilities, the 
“natural” definition of limit that suggests itself is one based fundamentally on the 
expected utility idea (Bernardo, Ferrandiz and Smith, 1985). 

Let us therefore consider a decision problem {.A,€,C, <}, which is described 
by a probability space {Q, F. P} and utility function u: C — §, and let 


D={d:Q-C; Ud) = i u(d(w)) dP(w) < oo}. 
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In other words, D consists of those functions (soon to be called decisions) d : Q — C 
for which uo d = u(d(.)) is a random quantity whose expectation exists. In the 
case of the particular subset A of D, u(a) = W(a|{2) is precisely the expected 
utility of the simple option a (see Definition 2.16 with G = Q; we shall return 
later to the case of a general conditioning event G and corresponding probability 
measure P(-|G)). In all the definitions and propositions in this section. A and 
{Q. F, P} are to be understood as fixed background specifications. 


Definition 3.11. (Convergence in expected utility). For a given utility func- 
tion, wu: C — R, a sequence of functions d,.dz.... in D is said to w-converge 
to a function d in D, written d, —, d. ifand only if 


(i) wod; converges to wo d almost surely (with respect to P), 


(ii) Td,) — Td). 


Definition 3.12. (Decisions). For a given utility function, uC — a func- 
tion d € D is a decision (generalised option) if and only if there exists a 
sequence a.Q2.... of simple options such that a, —, d; the value of 


u(d) = lim @(a,) 


is then called the expected utility of the decision d. 


Discussion of Definitions 3.11 and 3.12. In abstract mathematical terms, 
the extension from simple functions, mapping C to R. to more general functions 
requires some form of limit process. However. the fundamental coherence result 
of Proposition 2.25 was that simple options should be compared in terms of their 
expected utilities. In order for this to carry over smoothly to decisions (generalised 
options), it is natural to require a constructive definition of the latter in terms of a 
limit concept directly expressed in terms of expected utilities. 

As it stands, however. this constructive definition does not provide a straight- 
forward means of checking whether or not, given a specified utility function. uw. a 
function d € D is or is not a generalised option. However, we can prove that any 
d € D such that u o d is essentially bounded (i.e.. uo d is bounded except on a 
subset of 22 of P-measure zero) is a decision. More specifically. we can prove the 
following. 


Proposition 3.6. Given a utility function u :C — ®, for any function d € D 
such that uo d ts essentially bounded. there exist sequences a,.Q2.... and 
a\,Q.... of simple options such that a, —, d. ai —,, d and, for all i, 
MWa,) < (d) < Wa’). 
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Proof. We prove first that if uo d is essentially bounded above then there 
exists a sequence of simple acts a1, a2,.... such that a; >, d and %(a;) > u(d), 
for all i. (An exactly parallel proof exists if “above” is replaced by “below” and > 
by <.) We begin by defining the partitions {£;,, j € Jj}, i = 1,2,..., where 


Bj; = {w € 0; uod(w) < -i} if j = —i2'-1 
= {wEQ; wod(w) € [j2".(7+12"]} iff =—i2’...., 2-1 
= {w EN; wod(w) > i} if j = i2'. 


For each i, this establishes a partition of 2 into 2(2' + 1) events, in such a 
way that two extreme events contain outcomes with values of uo d(w) < —i or 
> i, whereas the other events contain outcomes whose values of uo d(w) do not 
differ by more than 2~‘. 

We now define a sequence {a;}, of simple options a, = {c,,|E.;. 9 € Ji} 
such that: 

(i) if P(E,;) = 0 then c,, is an arbitrary element of C; 


(ii) if P(Ej;) > Othenc¢;, € d(Ej,) and 


u(ey)P(E,) 2 | u(dlw)dP(w), 


uy 


To see that the c;; exist and are well defined, note that, since u(d) < oo, there 
exists U;;(d) < oc, defined by 


1 


10 = BE i udu) aPC) 


but, if u(d(w)) < t,;(d) for all w € E;,, then we would have 


[ u(d(w))dP(w) < %;(d)P(Ei,), 


+53 


thus contradicting the definition of ;;(d). 

By construction, a; — d almost surely. Hence, for all ¢ > 0, there exists 7 
such that uo d(w) € [—i, 4), with 27’ < < and, for this 7, |a;(w) — d(w)| < ¢. In 
addition, for all 7, 


a(a,) = J uley)P(B,) 2 [“@ (w) = W(d). 


J Jedy 
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To show that a; —,, d. it remains to prove that 
i) | u(d(w)) — u(a(w))|dP(w) — 0. as i— x. 
0 


Writing f, = fy + ye where A, = {w € 2 | u(d(w)) < —7}, we note that for 


sufficiently large 7 (larger than the essential supremum of wo ¢), 
| | u(d(w)) — u(a,(w)) |dP(w) < 2° P(A‘). 
Jae 


which converges to zero as / — %; Moreover, since Ti(a;) > u(d), for all 7, and 
u(d) < x, 


/ ju(d(w)) — u{a;(w))|dP{w) < -2 f u(d(w))dP(w) 


aa; vA, 


and. since A; — as i — x. this also converges to zero. 


In fact, we can show that any decision, whether or not «od is essentially 
bounded, can be obtained as the limit of “bounding sequences of simple options” 
in the sense made precise in the following. 


Proposition 3.7. Given a utility function u: C — R. for any decision d € D, 
there exist sequences a4.02..... and a).ay..... of simple options such that 
a, >, da, =, dand, for all}, t(a;) < Wd) < tia’). 


Proof. We shall show that there exists a sequence of simple options a;.a2.... 
such that a; —, d and, for all (, @(a,) < Ti(d). An obviously parallel proof exists 
for the other inequality. 

We first note that either uo d is essentially bounded above or it is not. In 
the former case, let A’ denote the essential supremum and define Ay = {w € 22: 
uod(w) = A}: in the latter case, define KW = x and Ay = 0. 

If P( Ay) > 0. choose a decreasing sequence of real numbers a, € [0.1] such 
thata, — 0. Then by Axiom 4 and Proposition 2.6 there exists a sequence of 
standard events $).S».... such that Sj.) > S,. y(S,) = a, and Pr(dy 1 S,) = 
P(Ao)a,. for all j; then, define A, = Ay 95, and choose a consequence « € C 
such that u(c) < A. 

If P( Ay) = 0, choose a consequence « € C and an increasing sequence of real 
numbers :3,, such that 3, — x and u(e) < 3): let 4, = {w © QDiuod(w) > 3;}. 
In either case, define 


_ fdjw). ifwe A 
Chey fe if w & Ay. 
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Since d is a decision, there exists a sequence a, a2, ... of simple options such that 

a; —, d. If, for each a;, we now define a new sequence of simple options a;,, by 
: _ fa lw). ifwe As 

aye) e if w € Ay, 


we clearly have aj; —y dj, so, that, for all i, d; is a decision. Moreover, by 
construction d; —, d, U(d;) < %(d) and u o d; is bounded above, for all 7. Hence, 
by Proposition 3.6, there exist sequences of simple options 


a, ws .+++, Such that a’) —, dj and ua") > u(d,). 


If we now choose a subsequence 


Gas. ..., Suchthat @ (a\) ~ u(d;) < u(d) — U(d;). 


the required result follows, since for all k, 


and d; —, d@ implies that ay, od g 


3.3.2 Generalised Preferences 


Given the adoption, for mathematical or descriptive convenience, of the extended 
framework developed in the previous section, it is natural to require that prefer- 
ences among simple options should “carry over”, under the limit process we have 
introduced, to the corresponding decisions. This is made precise in the following. 


Postulate 2. (Extension of the preference relation). Given a utility function 
u:C — ®, for any decisions d,d2, and sequences of simple options {a;} and 
{a‘} such that {a;} >, d\, and {a} >, dz, we have: 

(i) if for alli > i, for some ip, a, > a’, then d, > do; 


(ii) if, for allt > io, for some iy, a; > a), then d, > d2; 


The first part of the postulate simply captures the notion of the carry-over of 
preferences in the limit; the second part of the postulate is an obvious necessary 
condition for strict preference. 

Together with our previous axioms, this postulate enables us to establish a very 
general statement of the identification of quantitative coherence with the principle 
of maximising expected utility. 
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Proposition 3.8. (Maximisation of expected utility for decisions). 
Given any two decisions d,,d2, 
di >d, <= u(d;) > u(d.). 

Proof. We first establish that %(d2) > i(d,) implies that dz > d,. By Propo- 
sition 3.7, there exist sequences of simple options a ).@»..... and @.a)..... such 
that a; —, a, a) —, dy and, for all i, @(a,) > U(d,) and Tal) < W(d2). With 
¢ = [U(dy) — W(d))]/3, we can choose /;. iz such that, for all j > max{/)./2}. 

ti(a,) — U(d\) <2. W(d2) - ua’) <i, 
and we can choose 7. i such that, for all j > ¢, 
Ui(a,) — Udi) < Ua,,) — U(d;) 
and, for all j > 7, 
u(d2) - aa) < W(d2) - Ta',,). 
It follows from Proposition 2.25 that, for all j > max{i). 5}. 
a’, 2 di, >a, 2a, 


and so, by Postulate 2, d; > d). 
To complete the proof (d, ~ dz => 7i(d,} = U(d2) being obvious). we must 
show that %(d,) = 7%(d») implies that d; ~ dy. By Proposition 3.7, there exist 


sequences of simple options (a’*), hk = [. 2.3.4) such that a!" —, d, fork = 1.2. 
al"! _s,, dy for k = 3.4. and, for all i. 
ua!) < Gd) < Wal”). Wal”) < Wdz) < Wa;”’). 
Since we have u(d;) = @(d2). this implies, by Proposition 2.25. that a!!! > al", 
and a” > ae", for all 7, and hence, by Postulate 2, d) > d, and d, > d», so that 
dy ~ dy, Pal 
Proposition 3.9. Fur any G > , 


dy PC dy = Ti(d, |G) > T(d2|G). 


Proof. Throughout the above, the probability measure /(-) can be replaced 
by P(-|G) without any basic modifications to the proofs and results. Writing 


t;(d) = | uod(w)dP(w/G). 
it is easily verified that P(G)tig¢-(d) = @( 1¢;, od), where 1c; is the indicator function 


of G. It follows that if « o d is integrable with respect to P(-) then it is integrable 
with respect to P(-|G). 
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This establishes in full generality that the decision criterion of maximising 
the expected utility is the only criterion which is compatible with an intuitive set 
of quantitative coherence axioms and the natural mathematical extensions encap- 
sulated in Postulates | and 2. Specifically, we have shown that, given a general 
decision problem, where w € 22 labels the uncertain outcomes associated with 
the problem, u(d(w)) describes the current preferences among consequences and 
p(w), the probability density of P with respect to the appropriate dominating 
measure, describes current beliefs about w, the optimal action is that dj which 
maximises the expected utility, 


u(d) = f(a) le) de 


As we saw in Section 2.6.3, it is natural before making a decision to consider 
trying to reduce the current uncertainty by obtaining further information by ex- 
perimentation. Whether or not this is sensible obviously depends on the relative 
costs and benefits of such additional information, and we shall now extend the 
notions related to the value of information, introduced in Section 2.6.3, into the 
more general mathematical framework established in this chapter. 


3.3.3 The Value of Information 


For the general decision problem, the decision tree for experimental design, given 
originally in Figure 2.6, now takes the form given in Figure 3.2, where, as in Section 
2.6.3, the utility notation is extended to make explicit the possible dependence on 
the experiment performed e (or ey if no data are collected) and the data obtained, 
x. If dj is the optimal decision corresponding to ey, the expected utility from an 
optimal decision with no additional information is defined by 


U(eo) = (dj. eo) = sup | u(d. ey. w) p(w | eg, d) dw. 
d 


Let dt be the optimal decision after experiment e has been performed and data 
a have been obtained, so that u(dz,e, a), the expected utility from the optimal 
decision given e and z, is 


u(dz,e.xz) = sup [ u(d,e,x2.w)p(w|e,ax,d)dw, 
d 2 


and, hence, the expected utility from the optimal decision following e is 


H(e) = re U(d2,e. a) pla |e) dex, 
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u(d. ey. we) 


Figure 3.2 Generalised decision tree for experimental design 


where p(a | «) describes beliefs about the occurrence of a, were ¢ to be performed. 


Proposition 3.10. (Optimal experimental design). The optimal decision is 
to perform experiment e* if t(e") = max, (e) and We") > Wen). and to 
perform no experiment otherwise. 


Proof. This tollows immediately.  g 


The expected value of the information provided by additional data 2 may 
be computed as the (posterior) expected difference between the utilities which 
correspond to optimal decisions after and before the data. Thus. 


Definition 3.13. (The value of additional information). 
(i) The expected value of the information provided by x, is 
(ez) = [ {u(dz.e. z.w) — u(d).€o.w) p(w fe.a. dt) dw : 
2 


(ii) the expected value of the experiment € is 


v(e) = if r(e. x) p(x |e) dx. 


JN 


Let us now consider the optimal decisions which would be available to us if 
we knew the value of w. Thus. let d*. be the optimal decision given w: i.e.. such 
that, for all d. 

u(d’..€).w) > ul(dicow). wed. 
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Then, given w, the loss suffered by choosing another decision d # d{,, would be 
u(d®.,e9,w) — u(d, eo, w). 


For d = dj, the optimal decision with no additional data, this utility difference 
measures (conditional on w) the value of perfect information. Its expected value 
with respect to p(w) will define, under certain conditions, an upper bound on 
the increase in utility which additional information about w could be expected to 
provide. 


Definition 3.14. (Expected value of perfect information). The opportunity 
loss of choosing d is defined to be 


I(d,w) = u(d®, ey), w) — u(d.eo.w), 


and the expected value of perfect information about w is defined by 
veo} = f Udi.) ple leads) de 
fr) 


where dj is the optimal decision with no additional information. 


As we remarked in Section 2.6.3, in many situations the utility function may 
often be thought of as made up of two separate components: the experimental cost 
of performing € and obtaining «x, and the utility of directly taking decision d and 
finding w to be the state of the world. Given such an (additive) decomposition, we 
can establish a useful upper bound for the expected value of an experiment. 


Proposition 3.11. (Additive decomposition). If the utility function has the 
form 
u(d.e.x.w) = u(d. ey, w) — cle. x), 


with c(e. x) > 0, and the probability distributions are such that 
pwle,x.d) = pwle,x). p(wleo,d) = p(w eo), 
then, for any available experiment e, 
v(e) < v" (eo) — Ee), 
where ¢(e) = f c(e. x) p(x |) da is the expected cost of e. 


Proof. This closely parallels the proof, given in Proposition 2.27, for the finite 
case. g 


This concludes the mathematical extension of the basic framework and asso- 
ciated axioms. In the next section, we reconsider the important special problem 
of statistical inference, previously discussed in detail in its finitistic setting in Sec- 
tion 2.7. 
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3.4 GENERALISED INFORMATION MEASURES 
3.4.1. The General Problem of Reporting Beliefs 


In Section 2.7, we argued that the problem of reporting a degree of belief distribu- 
tion for a (finite) class of exclusive and exhaustive “hypotheses” {#/,. j € J}. 
conditional on some relevant data D and initial state of information A/o, could be 
formulated as a decision problem {€.C.A.<}. Here, {F,. j € .J} is a partition 
of Q, consisting of elements of €. with the interpretation £, = “hypothesis H; is 
true”, A relates to 


Q={q=B(qj. JES): Gg, 20 eth = 1}. 


where q, is the probability which. conditional on D. an individual reports as the 
probability of £, being true. and the set of consequences C consists of all pairs 
(q. E,). representing the possible conjunction of reported beliefs and true hypothe- 
ses. In the previous finitistic setting, we denoted by 


p={p,=P(E,|D). jeJ}. p,>0 Desh = I. 


the probability measure describing an individual’s actual beliefs. conditional on D. 
We then proceeded to consider a special class of utility functions (score functions) 
appropriate to this reporting problem and to examine the resulting forms of implied 
decisions and the links with information theory. In this section, we shall generalise 
these concepts and results to the extended framework developed in the previous 
sections. 

The first generalisation consists in noting that the set of alternative “hypothe- 
ses” now corresponds to the set of possible values of a (possibly continuous) random 
vector, w, say, labelling the “unknown states of the world”, so that the relevant un- 
certain events are FE. = {w},w € (2. with the interpretation E_. = “the hypothesis 
w is true”. Quantitative coherence requires that any particular individual’s un- 
certainty about w. given data D and initial state of information 1\/,. should be 
represented by a probability distribution P over a a-algebra of subsets of 92. which 
we shall assume can be described by a density (to be understood as a mass function 
in the discrete case) 


Pwo(-|D) = {plu D». weEQ. p(w] D)> 0. [v6] D) dw = i} . 


We shall take the set of possible inference statements to be the set of probability 
distributions for w, compatible with D. We denote by D the set of functions d,,. 
one for each p(-| 2), which map w to the pair (p.y{- | D).w). 
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3.4.2 The Utility of a General Probability Distribution 


In this general setting, the problem of providing an inference statement about aclass 
of exclusive and exhaustive “hypotheses” {w, w € 0}, conditional on data D, is 
a decision problem, which we can conveniently denote by {D.Q. u, P}, where 2 
is the set of possible values of the random quantity w, D relates to the class of 
probability densities for w € 2 compatible with D, 


25 {asl-|D): q(w|D) > 0. w € XD and [ wi) do =1} , 
2 


where qu (- | D) is the density which an individual reports as the basis for describing 
beliefs about w conditional on D. The set of consequences C consists of all pairs 
(dg, w), representing the conjunction of reported beliefs and true “states of nature”. 
Throughout this section, we shall denote an individual's actual belief density by 
Pw(- | D). The decision space D consists of d,’s corresponding to choosing to report 
Qu:(- | D)’s and defined by d,(w) = (qu.(-| D),w). We shall assume the individual 
to be coherent, so that dp € D. Without loss of generality, we shall assume that 
Pw(-|D) and the qu(-|D) € Q are strictly positive probability densities, so that, 
forall w € 2, p(w | D) > 0 and qg(w|D) > 0 forall d, € D. 

We complete the specification of this decision problem, by inducing the pref- 
erence ordering through direct specification of a utility function u, which describes 
the “value” u(q..(-| D).w) of reporting the probability density q,(-|_D) were w 
to turn out to be the true “state of nature”. For this purpose and with the same mo- 
tivation, we generalise the notion of score function introduced in Definition 2.20. 


Definition 3.15. (Score function). A score function for probability densities 
qu (-|D) defined on Q, is a mapping u: Q x Q — R. A score function is said 
to be smooth if it is continuously differentiable as a function of q(w | D) for 
eachw EQ. 


The solution to the decision problem is then to report the density q.,(-| D) 
which maximises the expected utility 


(dy) = f u(ao(-|D).t2) wles| D) do 


As in our earlier development in Chapter 2, we shall wish to restrict utility 
functions for the reporting problem in such a way as to encourage a coherent 
individual to be honest, given data D, in the sense that his or her expected utility 
is maximised if and only if d, is chosen such that, for each w € 2, g(w|D) = 
p(w | D). The appropriate generalisation of Definition 2.21 is the following. 
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Definition 3.16. (Proper score function). A score function u is proper if, for 
each strictly positive probability density p{-| D). 


sup | u(qu(-| D).w) pw) D) dw = [ wlnot |D).w) p(w| D) dw. 
Q . : 


where the supremum, taken over the class Q ofall distribution for w compatible 
with D, is attained if and only if qa(-|D) = qu(-|D). up to sets of zero 


measure, 


As in the finite case (see Definition 2,22), the simplest proper score function 
in the general case is the quadratic. 


Definition 3.17. (Quadratic score function). A quadratic score function for 


probability densities q(-| D) € Q defined on Q isa mapping u:OxQ—-R 
of the form 


u(qu(-|D).w) =A {Zale |D)- [ew | D) iw} + Bw). A>0. 


such that the otherwise arbitrary function, B(-), ensures the existence of u(d,) 
foralid, € D. 


Proposition 3.12. A quadratic score function ts proper. 


Proof. Given data D, we must choose q,(-| D) € Q to maximise 


U(d,) = if u(gu(-|D).w) p(w | D) dw 
” 


= [ E {2alw | D)- [ (uw) Dydu} + Biw)| p(w|D) dw. 
0 


subject to f g(w|D)dw = 1. Rearranging, it is easily seen that this is equivalent 
to maximising 


- [ww D) — q(w|D))? dw. 
0 
from which it follows that we require q(w | D) = p(w | D) for almost all w € Q. 


We note again (cf. Proposition 2.28) that the constraint f g(w | D)dw = 1 has not 
been needed in establishing this result for the quadratic scoring rule. < 
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In fact, as we argued in Section 2.7, for the problem of reporting pure inference 
statements it is natural to restrict further the class of appropriate utility functions. 
The following generalises Definition 2.23. 


Definition 3.18. (Local score function). A score function is local if for each 
qu(-|D) € QO there exist functions uy, w € Q, defined on R* such that 


u(qu(- | D).w) = Uw (q(w | D)). 


Note that, as in Definition 2.23, the functional form, u,,(-), of the dependence 
of the score function on the density value g(w | D) which d, assigns to w is allowed 
to vary with the particular w in question. Intuitively, this enables us to incorporate 
the possibility that “bad predictions”, i.e., values of g(w | D), for some “true states 
of nature”, w, may be judged more harshly than others. 

The next result generalises Proposition 2.29 and characterises the form of a 
smooth, proper, local score function. 


Proposition 3.13. (Characterisation of proper local score functions). 
Ifu: Qx Q— Ris a smooth, proper, local score function, then it must be of 
the form 

u(qu(-|D),w) = Alog q(w|D) + B(w) 


where A > 0 is an arbitrary constant and B(.) is an arbitrary function of w, 
subject to the existence of ti(d,) for all d, € D. 


Proof. Given data D, we need to maximise, with respect to q(-| D), the ex- 
pected utility 


ii(d,) = ii u(qw(-|D),w) p(w | D) dew 


subject to [,, q(w | D)dw = 1. Since w is local, this reduces to finding an extremal 
of 


F(qu(-|D)) = f s(aler|D)) rlw| D) deo — A | [awiD duo ~ 1). 


However, for q.,(-| D) to give a stationary value of F(q.,(-| D)) it is necessary 
that 


‘ 


< F(a | D) + ar(w))| = 0, 


for any function 7 : 2 — ¥& of sufficiently small norm (see, for example, Jeffreys 
and Jeffreys, 1946, Chapter 10). This condition reduces to the differential equation 


Dit, (q(w | D)) p(w| D) - A =0, 
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where D,u,, denotes the first derivative of u,,. But, since u,, is proper, the maxi- 
mum of F'(q.,(-| D)) must be attained at q,(-| D) = pa(-| D). so that a smooth, 
proper. local utility function must satisfy the differential equation 


Dy uy (p(w | D)) p(w|D)- A=0. 
whose solution is given by 


tw (p(w | D)) = Alog p(w} D) + Biw). 


aS stated. g 
This result prompts us to make the following formal definition. 


Definition 3.19. (Logarithmic score function). A logarithmic score func- 
tion for probability densities qu(-|D) € Q defined on Q is a mapping 
u:QOxQ— Rof the form 


u(qu(-|D).w) = Alogq(w|D)+ Biw). 
A > 0, B(-) arbitrary, subject to the existence of a(d,) for alld, © D. 


For additional discussion of generalised score functions see Good( 1969) and 
Buehler (1971). 


3.4.3 Generalised Approximation and Discrepancy 


As we remarked in Section 2.7.3. although the optimal solution to an inference 
problem under the above conditions is to state one’s actual beliefs. there may 
be technical reasons why the computation of this “optimal” density p,,(-|D) is 
difficult. In such cases, we may need to seek a tractable approximation, q(- | D). 
say, which is in some sense “close” to p,,(-|D),. but much easier to specify. As in 
the previous discussion of this idea. we shall need to examine carefully this notion 
of “closeness”. The next result generalises Proposition 2.30. 


Proposition 3.14. (Expected loss in probability reporting). If preferences 
are described by a logarithmic score function, the expected loss of utility 
in reporting a probability density q(w | D), rather than the density p(w | D) 
representing actual beliefs, is given by 


5(qw(-|D) | po(-|2)) = A | p(w | D) log rE dw. 


Moreover, 8(qu(-| D) | Pw(-|D)) is non-negative and is zero if, and only if. 
qu} D) = put-| DP). 
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Proof. From Definition 3.19 the expected utility of reporting q,,(- | D) when 
Pw(-| D) is the actual belief distribution is given by 


u(d,) = J (Atogaw|D) + B(w)] p(w|D) dw, 


so that 


p(w | D) 
wid) 


The final condition follows either from the fact that u is proper, so that %(d,) > 
U(d,) with equality if and only if q(-|D) = pu(-| D); or directly from the fact 
that for all x > 0, logz < x — 1 with equality if and only if x = 1 (cf. Proposition 
2.30). 


5(qus(- (-| D) | Pus(- |D)) = U(d,) — U( a) =A | p (w | D) log ——— 


As in the finitistic discussion of Section 2.7, the above result suggests a natural, 
general measure of “lack of fit”, or discrepancy, between a distribution and an 
approximation, when preferences are described by a logarithmic score function. 


Definition 3.20. (Discrepancy of an approximation). 
The discrepancy between a strictly positive probability density p,,(-) and an 
approximation p,,(-), w € Q, is defined by 


- p(w) 
(+) | Pur(-)) = | rw) iog > dw. 


Example 3.4. (General normal approximations). Suppose that p(w) > 0. w € R, is 
an arbitrary density on the real line, with finite first two moments given by 


‘a w p(w) dur = m. [ (w ~ an)? p(w) dw = t7!, 


and that we wish to approximate p(-) by a p(-) corresponding to a normal density, N(w | u, A), 
with labelling parameters 4, \ chosen to minimise the discrepancy measure given in Defi- 
nition 3.20. It is easy to see that. subject to the given constraints, minimising 6(p |p) with 
respect to yz and is equivalent to minimising 


= E p(w) log N(w | yu. A) du. 


and hence to minimising 


A f* ‘ 
—$logdA+ 7 / p(w)(w — pe)? dw. 
~ 
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Invoking the two moment constraints, and writing (2 — ys)? = (w — a tom — ye)? this 
reduces to minimising, with respect to j and A, the expression 


A ; 
~ log A+ . +A(Qin— ya)’. 


It follows that the optimal choice is f: = 1, A = f. In other words, for the reporting 
problem with a logarithmic score function, the best normal approximation to a distribution 
on the real line (whose mean and variance exist) is the normal distribution having the same 
mean and variance. 


Example 3.5. (Normal approximations to Student distributions). Suppose that we 
wish to approximate the density St(| je. A.a). a > 2, by a normal density. We have just 
shown that the best normal approximation to any distribution is that with the same first two 
moments (assuming the latter to exist. corresponding here to the restriction «+ > 2). Thus. 
recalling from Section 3.2.2 that the mean and precision of St(.° | jz. A.@) are given by j: and 
A(a — 2)/a, respectively, it follows that the best normal approximation to St(.r | jr. Ava) is 
provided by N(x | js. A(a — 2)/a). From Definition 3.20. the corresponding discrepancy 
will be 
St(r | pe A.a) 


N(« ip A(a — 2)/a) = 


O(N | St) = [ S01 p.d.0) bog 


0.08 6(N | St) 


0.02 


0 10 20 30 
Figure 3.3. Discrepancy berween Student and normal densities 


This is easily evaluated (see, for example. Bernardo, 1978a) using the fact that the entropy 
of a Student distribution is given by 


H{Str} pe Acay} = - Jf Stn. dee) log Str |p. 20) de 


= T((a + 1)/2) 1, A at -) (a fadl 
be rarua 28a t loo) tt ete 
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where y(z) = I’(z)/I(z) denotes the digamma function (see, for example, Abramowitz 
and Stegun, 1964), from which it follows that 6(N | St) may be written as 


log oa + (* ; *) E (5) —U (S)] + ; [2 - log (5 - 1)] . 


which only depends on the degrees of freedom, a, of the Student distribution. Figure 3.3 
shows a plot of 6(N | St) against cv. 
Using Stirling’s approximation, 


log P(z) ~ (: - 3) log: -— 2+ ; log(2z). 
we obtain, for moderate to large values of a, 
5(N | St) = [a(@ — 2)} ' = O(1/a?), 


so that [a(@ —2)] ' provides a simple, approximate measure of the departure from normality 
of a Student distribution, 


3.4.4 Generalised Information 


In Section 2.7.4, we examined, in the finitistic context, the increase in expected 
utility provided by given data D. We now extend this analysis to the general 
setting, writing x to denote observed data D. 


Proposition 3.15. (Expected utility of data). If preferences are described by 
a logarithmic score function for the class of probability densities p(w |x) 
defined on Q, then the expected increase in utility provided by data x, when 
the prior probability density is p(w), is given by 


A [ pw|2) tog MT dw 


where p(w | x) is the density of the posterior distribution for w, given x. This 
expected increase in utility is non-negative, and zero if, and only if, p(w |x) 
is identical to p(w). 


Proof. Using Definition 3.19, the expected increase in utility provided by x 
is given by 


[{v log p(w |) + B(w)] — [A log p(w) + B(w)] } p(w |x) dw 
= p(w |x) 
= A | r(wl2) log ma dw. 


which, by Proposition 3.14, is non-negative with equality if and only if p(w |x) 
and p(w) are identical. g 
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The following natural definition of the amount of information provided by the 
data extends that given in Definition 2.26. 


Definition 3.21. (Information from data). The amount of information about 

w € Q provided by data x when the prior density is p(w) ts given by 

plw |) 
p(w) 


where p(w |x) is the corresponding posterior density. 


Hx | pol} = f owl2) log dw 


As in the finite case, it is interesting to note that the amount of information 
provided by a is equivalent to the discrepancy measure if the prior is considered as 
an approximation to the posterior. Alternatively, we see that log p(w). log p(w | az). 
respectively. measure how “good”. on a logarithmic scale, the prior and posterior 
are at “predicting” the “true state of nature” w. so that log p(w |x) — log p(w) 
is a measure of the usefulness of 2 were w known to be the true value. Thus 
{a | puo(-)} is simply the expected value of that utility difference with respect to 
the posterior density. given a. 


The functional [ p(w) log p(w) dw has been used (see e.g.. Lindley. 1956, and 
references therein) as a measure of the ‘absolute’ information about « contained 
in the probability density p(). The increase in utility from observing . is then 


J vlee| ey tog ote 2) ee - fri.) pid ee 


instead of our Detinition 3.21. However. this expression is nor invariant under 
one-to-one transformations of .. a property which seems to us to be essential. 
Note, however. that both expressions have the same expectation with respect 
to the distribution of 2. Draper and Guttman (1969) put forward yet another 
non-invariant definition of information. 


Additional references on statistical information concepts are Renyi (1964. 
1966, 1967), Goel and DeGroot (1979) and De Waal and Groenewald (1989), 

More generally, we may wish to step back to the situation before data become 
available. and consider the idea of the amount of information to be expected from 
an experiment «. We therefore generalise Definition 2.27. 


Definition 3.22. (Expected information from an experiment). The expected 
information to be provided by an experiment ¢ about wo € OQ. when the prior 
density is p(w) is given by 


Ie | Pwr(-)} = [ ie | pu(-)} plz l[e)da. 


where the distribution of the possible data outcomes x © X resulting from the 
experiment ¢ is described by p(x |). 
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The following result, which is a generalisation of Proposition 2.32, provides 
an alternative expression for I{e | pu(-)}. 


Proposition 3.16. An alternative expression for the expected information is 


H{e| pul} = [ [vw-zie hog OE dw dz, 


where p(w,x|e) = p(w|x,e)p(x |e) and p(w | x, e) is the posterior density 

for w given data x and prior density p(w). Moreover, I {e | Pus(-)} > 0, with 
equality if and only if x and w are independent random quantities, so that 
p(w, x|e) = p(w)p(x |e) for all w and x. 


Proof. 
T{e| pa()} = [{ fowize tog MOIS? ah p(x |e) dx 
=f [ ww\2.0) viele) 19g dw dx 


and the result now follows from the fact that p(w|a.e) = p(w|x.e)p(x |e). 
Moreover, since, by Proposition 3.14, /{e | pu(-)} > O with equality if and only 
if p(w |x,e) = p(w), it follows from Definition 3.19 that I{e | p.(-)} > 0 with 
equality if and only if, for all w and x, p(w. x}e) = p(w)p(rle). g 


Maximisation of the expected Shannon information was proposed by Lind- 
ley (1956) as a “reasonable” ad hoc criterion for choosing among alternative ex- 
periments. Fedorov (1972) proved later that certain classical design criteria (in 
particular, D-optimality) are special cases of this when normal distributions are 
assumed. We have shown that maximising expected information is just a particular 
(albeit important) case of the general criterion, implied by quantitative coherence, 
of maximising the expected utility in the case of pure inference problems. See 
Polson (1992) for a closely related argument. 


It follows from Proposition 2.31 and the last remark, that someone who adopts 
the classical D-optimality criterion of optimal design under standard normality 
assumptions should. for consistency. have preferences which are described by a 
logarithmic scoring rule; otherwise, such designs are noi optimal with respect to 
his or her underlying preferences. 


There is aconsiderable literature on the Bayesian design of experiments, which 
we will not attempt to review here. A detailed discussion will be given in the volume 
Bayesian Methods. We note that important references include Blackwell (1951, 
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1953). Lindley (1956), Chernoff (1959), Stone (1959), DeGroot (1962, 1970). 
Duncan and DeGroot (1976), Bandemer (1977). Smith and Verdinelli (1980). 
Pilz (1983/1991), Chaloner (1984), Sugden (1985), Mazloum and Meeden (1987). 
Felsenstein (1988, 1992), DasGupta and Studden (1991), El-Krunz and Studden 
(1991), Pardo et al. (1991), Mitchell and Morris (1992). Pham-Gia and Turkkan 
(1992), Verdinelli (1992), Verdinelli and Kadane (1992), Lindley and Deely (1993). 
Lad and Deely (1994) and Parmigiani and Berry (1994). 


3.5 DISCUSSION AND FURTHER REFERENCES 
3.5.1 The Role of Mathematics 


The translation of any substantive theory into a precise mathematical formalism 
necessarily involves an element of idealisation. 

We have already had occasion to remark on aspects of this problem in Chap- 
ter 2, in the context of using real numbers rather than subsets of the rationals to 
represent actual measurements (necessarily “finitised” by inherent accuracy limits 
of the uncertainty apparatus). Similar remarks are obviously called for in the con- 
text of using, for example. probability densities to represent belief distributions for 
real-valued observables. 

In some situations, as we shall see in Chapter 4. the adoption of specific 
forms of density may follow from simple, structural assumptions about the form 
of the belief distribution. In other situations, however, if we really try to think of 
such a density as being practically identified by expressions of preference among. 
say. standard options, we would encounter the obvious operational problem that, 
implicitly, an infinite number of revealed preferences would be required. 

Clearly, in such situations the precise mathematical form of a density is likely 
to have arisen as an approximation to a “rough shape” obtained from some finite 
elicitation or observation process, and has been chosen. arbitrarily, for reasons 
of mathematical convenience. from an available mathematical tool-kit. Similar 
remarks apply to the choice, for descriptive or mathematical convenience, of infinite 
sets to represent consequences or decisions, with the attendant problems of defining 
appropriate concepts of expected utility. 

There are obvious dangers, therefore, in accepting too uncritically any ori- 
entation, or would-be insightful mathematical analysis. that flows from arbitrary. 
idealized mathematical inputs into the general quantitative coherence theory. How- 
ever, given an awareness of the dangers involved, we can still systematically make 
use of the power and elegance of the (idealised) mathematics by simultaneously 
asserting. as a central tenet of our approach, a concern with the robustness and sen- 
sitivity of the output of an analysis to the form of input assumed (see Section 5.6.3). 
Of course, we shall later have to make precise the sense in which these terms are 
to be interpreted and the actual forms of procedures to be adopted. That being 
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understood, our approach, as with the earlier formalism of Chapter 2, will be to 
work with the mathematical idealization, in order to exploit its potential power and 
insight, while constantly bearing in mind the need for a large pinch of salt and a 
repertoire of sensitivity diagnostics. 


3.5.2 Critical Issues 


We shall comment further on three aspects of the general mathematical structure 
we have developed and will be using throughout the remainder of this volume. 
These will be dealt with under the following subheadings: (i) Finite versus Count- 
able Additivity; (ii) Measure versus Linear Theory; (iii) Proper versus Improper 
Probabilities; (iv) Abstract versus Concrete Mathematics. 


Finite versus Countable Additivity 


In Chapter 2, we developed, from a directly intuitive and operational perspective, 
a minimal mathematical framework for a theory of quantitative coherence. The 
role of the mathematics employed in this development was simply that of a tool to 
capture the essentials of the substantive concepts and theory; within the resulting 
finitistic framework we then established that uncertainties should be represented in 
terms of finitely additive probabilities. 

The generalisations and extensions of the theory given in the present chapter 
lead, instead, to the mathematical framework of countable additivity, within which 
we have available the full panoply of analytic tools from mathematical probability 
theory. The latter is clearly highly desirable from the point of view of mathematical 
convenience, but it is important to pause and consider whether the development of 
a more convenient mathematical framework has been achieved at the expense of a 
distortion of the basic concepts and ideas. 

First, let us emphasise that, from a philosophical perspective, the monotone 
continuity postulate introduced in Section 3.1.2 does not have the fundamental 
status of the axioms presented in Chapter 2. We regard the latter as encapsulating 
the essence of what is required for a theory of quantitative coherence. The former 
is an “optional extra” assumption that one might be comfortable with in specific 
contexts, but should in no way be obliged to accept as a prerequisite for quantitative 
coherence. 

Secondly, we note that the effect of accepting that preferences should conform 
to the monotone continuity postulate is to restrict one’s available (in the sense of 
coherent) belief specifications to a subset of the finitely additive uncertainty mea- 
sures; namely, those that are a/so countably additive. This is, of course, potentially 
disturbing from a subjectivist perspective, since a key feature of the theory is that 
the only constraints on belief specifications should be that they are coherent. For 
some such representations to be ruled out a priori, as a consequence of a postulate 
adopted purely for mathematical convenience, would indeed be a distortion of the 
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theory. This is why we regard such a postulate as different from the basic axioms. 
However, provided one is aware of. and not concerned about. the implicit restriction 
of the available belief representations. its adoption may be very natural in contexts 
where one is, in any case, prepared to work in an extended mathematical setting. 

Throughout this work, we shall. in fact. make systematic use of concepts 
and tools from mathematical probability theory. without further concern or debate 
about this issue. However, to underline what we already said in Section 3.1.2. it is 
important to be on guard and to be aware that distortions might occur. To this end. 
we draw attention to some key references to which the reader may wish to refer in 
order to heighten such awareness and to study in detail the issues involved. 

De Finetti (1970/1974. pp. 116-133, 173-177 and 228-241: 1970/1975, 
pp. 267-276 and 340-361). provides a wealth of detailed analysis, illustration and 
comment on the issues surrounding finite versus countable (and other) additivity 
assumptions, his own analysis being motivated throughout by the guiding principle 
that 


... mathematics is an instrument which should conform itself strictly to the exigen- 
cies of the field in which it is to be applied. (1970/1974. p. 3) 


Further technical and philosophical discussion is given in de Finetti (1972, Chapters 
5 and 6); see, also, Stone (1986). Systematic use of finite additivity in decision- 
related contexts is exemplified in Dubins and Savage (1965/1976), Heath and Sud- 
derth (1978, 1989), Stone (1979b), Hill (1980). Sudderth (1980). Seidenfeld and 
Schervish (1983), Hill and Lane (1984). Regazzini (1987) and Regazzini and Petris 
(1993). A discussion of the statistical implications of finitely additive probability 
is given by Kadane e? al. (1986). 

In Section 2.8.3 we discussed , within a finitistic framework. several “betting” 
approaches to establishing probability as the only coherent measure of degree of 
belief. These ideas may be extended to the general case. Dawid and Stone (1972. 
1973) introduce the concept of “expectation consistency. and show the necessity 
of using Bayes’ theorem to construct probability distributions corresponding to fair 
bets made with additional information. Other generalised discussions on coherence 
of inference in terms of gambling systems include Lane and Sudderth (1983) and 
Brunk (1991). 


Measure versus Linear Theory 


Mathematical probability theory can be developed. equivalently. starting either from 
the usual Kolmogorov axioms for a set function defined over a a-field of events 
(see, for example, Ash, 1972). or from axioms for a linear operator defined over 
a linear space of random quantities (see. for example. Whittle, 1976). The former 
deals directly with probability measure. the latter with an expectation operator (or 
a prevision. in de Finetti’s terminology). 
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In our development of a quantitative coherence theory, the axiomatic approach 
to preferences among options has led us more naturally towards probability mea- 
sures as the primary probabilistic element, with expectation (prevision) defined sub- 
sequently. In the approach to coherence put forward in de Finetti (1972, 1970/1974, 
1970/1975), prevision is the primary element, with probability subsequently emerg- 
ing as a special case for O-! random quantities. The case for adopting the linear 
rather than the measure theory approach is argued at length by de Finetti, there 
being many points of contact with the argument regarding finite versus countable 
additivity, particularly the need to avoid, in the mathematical formulation, going 
beyond those aspects required for the problem in hand. In the specific context of 
statistical modelling and inference, Goldstein (1981, 1986a, 1986b, 1987a, 1987b, 
1988, 1991, 1994) has systematically developed the linear approach advocated by 
de Finetti, showing that a version of a subjectivist programme for revising beliefs 
in the light of data can be implemented without recourse to the full probabilistic 
machinery developed in this chapter. Lad ef al. (1990) provide further discussion 
on the concept of prevision. 

We view these and related developments with great interest and with no dog- 
matic opinion concerning the ultimate relative usefulness and acceptance of “lin- 
ear” versus “probabilistic” Bayesian statistical concepts and methods. That said, 
the present volume is motivated by our conviction that, currently, there remains a 
need for a detailed exposition of the Bayesian approach within the, more or less, 
conventional framework of full probabilistic descriptions. 


Proper versus Improper Probabilities 


Whether viewed in terms of finite or countable additivity, we have taken probability 
to be a measure with values in the interval (0, 1]. However, it is possible to adopt 
axiomatic approaches which allow for infinite (or improper) probabilities: see, for 
example, Renyi (1955, 1962/1970, Chapter 2, and references therein), who uses 
conditional arguments to derive proper probabilities form improper distributions, 
and Hartigan (1983, Chapter 3), who directly provides an axiomatic foundation for 
improper or, as he terms them, non-unitary, probabilities. We shall not review such 
axiomatic theories in detail, but note that we shall encounter improper distributions 
systematically in Section 5.4. 


Abstract versus Concrete Mathematics 


When probabilistic mathematics is being used as a tool for the representation and 
analysis of substantive non-mathematical problems, rather than as a direct mathe- 
matical concern in its own right, there is always a dilemma regarding the appropriate 
level of mathematics to be used. Specifically, there are basic decisions to be made 
about how much measure-theoretical machinery should be invoked. The introduc- 
tion of too much abstract mathematics can easily make the substantive content seem 
totally opaque to the very reader at whom it is most aimed. On the other hand, too 
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little machinery may prove inadequate to provide a complete mathematical treat- 
ment, requiring the omission of certain topics, or the provision of just a partial, 
non-rigorous treatment, with insight and illustration attempted only by concrete 
examples. 

Thus far, we have tried to provide a complete, rigorous treatment of the Foun- 
dations and Generalisations of the theory of quantitative coherence, within the 
mathematical framework of Chapters 2 and 3. This chapter essentially defines the 
upper limit of mathematical machinery we shall be using and, in fact, most of our 
subsequent development will be much more straightforward. However, it will be 
the case, for example in Chapter 4, that some results of interest to us require rather 
more sophisticated mathematical tools than we have made available. Our response 
to this problem will be to try to make it clear to the reader when this is the case. 
and to provide references to a complete treatment of such results, together with 
(hopefully) sufficient concrete discussion and illustration to illuminate the topic. 

For more sophisticated mathematical treatments of Bayesian theory. the reader 
is referred to Hartigan (1983) and Florens ef al. (1990). 
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Chapter 4 


Modelling 


Summary 


The relationship between beliefs about observable random quantities and their 
representation using conventional forms of statistical models is investigated. It 
is shown that judgements of exchangeability lead to representations that justify 
and clarify the use and interpretation of such familiar concepts as parameters, 
random samples, likelihoods and prior distributions. Beliefs which have certain 
additional invariance properties are shown to lead to representations involving 
familiar specific forms of parametric distributions, such as normals and expo- 
nentials. The concept of a sufficient statistic is introduced and related to rep- 
resentations involving the exponential family of distributions. Various forms of 
partial exchangeability judgements about data structures involving several sam- 
ples, structured layouts, covariates and designed experiments are investigated, 
and links established with a number of other commonly used statistical models. 


4.1. STATISTICAL MODELS 
4.1.1 Beliefs and Models 


The subjectivist, operationalist viewpoint has led us to the conclusion that, if we 
aspire to quantitative coherence, individual degrees of belief, expressed as proba- 
bilities, are inescapably the starting point for descriptions of uncertainty. There can 
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be no theories without theoreticians; no learning without learners: in general. no 
science without scientists. It follows that learning processes, whatever their par- 
ticular concerns and fashions at any given point in time, are necessarily reasoning 
processes which take place in the minds of individuals. To be sure. the object of 
attention and interest may well be an assumed external. objective reality: but the 
actuality of the learning process consists in the evolution of individual, subjective 
beliefs about that reality. However, it is important to emphasise, as in our earlier 
discussion in Section 2.8. that the primitive and fundamental notions of individ- 
ual preference and belief will typically provide the starting point for inverpersonal 
communication and reporting processes. In what follows, both here, and more 
particularly in Chapter 5, we shall therefore often be concerned to identify and 
examine features of the individual learning process which relate to interpersonal 
issues, such as the conditions under which an approximate consensus of belicts 
might occur in a population of individuals. 

In Chapters 2 and 3, we established a very general foundational framework for 
the study of degrees of belief and their evolution in the light of new information, We 
now turn to the detailed development of these ideas for the broad class of problems 
of primary interest to statisticians: namely. those where the events of interest are 
defined explicitly in terms of random quantities, vr... ... r,, (discrete or continuous, 
and possibly vector-valued) representing observed or experimental data. 

In such cases, we shall assume that an individual's degrees of belief for 
events of interest are derived from the specification of a joint distribution func- 
tion P(.r;..... a,,), which we shall typically assume, without systematic reference 
to measure-theoretic niceties. to be representable in terms of a joint density function 
p(r,..... 2) (to be understood as a mass function in the discrete case). 

Of course, any such specification implicitly defines a number of other degrees 
of belief specifications of possible interest: for example. for 1 < im <n, 


PU... ry) = | Diary... .. r, dt)... dr, 
provides the marginal joint density for .1)...... 1",,,and 


DU Vjeciss eet LO poses lin) S Dl es ry) {p(t ypeecee tan) 


gives the joint density for the as yet unobserved .1,,,.4......",,. conditional on 
having observed 1, = 2y...-..0%, = 0). Within the Bayesian framework, this 
latter conditional form is the key to “learning from experience”. 


We recall that. throughout, we shall use notation such as P and pina generic sense, 
rather than as specifying particular functions. In particular. P may sometimes 
refer to an underlying probability measure. and sometimes refer to implied distri- 
bution functions. suchas P(.r,). PCr... .. r, Jor POr, gece eta Pp tide 
Similarly. we may write pO.r)). pOry....r,.). ete. so that. for example. 


PCE stew te PNY once Pad ot ple cse hai plt ins. tad 
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simply indicates that the conditional density for .t,,41..... 2, given r).....21, 
is given by the ratio of the specified joint densities. Such usage avoids notational 
proliferation, and the context will always ensure that there is no confusion of 
meaning. 


Thus far, however, our discussion is rather “abstract”. In actual applications 
we shall need to choose specific, concrete forms for joint distributions. This is 
clearly a somewhat daunting task, since direct contemplation and synthesis of the 
many complex marginal and conditional judgements implicit in such a specification 
are almost certainly beyond our capacity in all but very simple situations. We shall 
therefore need to examine rather closely this process of choosing a specific form 
of probability measure to represent degrees of belief. 


Definition 4.1. (Predictive probability model). A predictive model for a se- 


quence of random quantities £,,22,... is a probability measure P, which 
mathematically specifies the form of the joint belief distribution for any subset 
Of £},%2,.... 


In some cases, we shall find that we are able to identify general types of 
belief structure which “pin down”, in some sense, the mathematical representation 
strategy to be adopted. In other cases, this “formal” approach will not take us very 
far towards solving the representation problem and we shall have to fall back on 
rather more pragmatic modelling strategies. 

At this stage, a word of warning is required. In much statistical writing, the 
starting point for formal analysis is the assumption of a mathematical model form, 
typically involving “unknown parameters”, the main object of the study being to 
infer something about the values of these parameters. From our perspective, this 
is all somewhat premature and mysterious! We are seeking to represent degrees 
of belief about observables: nothing in our previous development justifies or gives 
any insight into the choice of particular “models”, and thus far we have no way of 
attaching any operational meaning to the “parameters” which appear in conventional 
models. However, as we shall soon see, the subjectivist, operationalist approach 
will provide considerable insight into the nature and status of these conventional 
assumptions. 


4.2 EXCHANGEABILITY AND RELATED CONCEPTS 
4.2.1. Dependence and Independence 


Consider a sequence of random quantities r,.72...., and suppose that a predictive 
model is assumed which specifies that, for all 7, the joint density can be written in 
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the form 


a 


p(n... r,) = [[ pt). 


ten] 


so that the .r; are independent random quantities. It then follows straightforwardly 
that, for any 1 <9 <n, 


PU Pinetree Py letjisaccs Fig) SDC ele 204 50 r,). 


so that no learning from experience can take place within this sequence of obser- 
vations. In other works, past data provide us with no additional information about 
the possible outcomes of future observations in the sequence. 

A predictive model specifying such an independence structure is clearly inap- 
propriate in contexts where we believe that the successive accumulation of data will 
provide increasing information about future events. In such cases, the structure of 
the joint density p(y... ... r,,) must encapsulate some form of dependence among 
the individual random quantities. In general. however. there are a vast number of 
possible subjective assumptions about the form such dependencies might take and 
there can be no all-embracing theoretical discussion. Instead. what we can do is 
to concentrate on some particular simple forms of judgement about dependence 
structures Which might correspond to actual judgements of individuals in certain 
situations. 

There is no suggestion that the structures we are going to discuss in subse- 
quent subsections have any special status, or ought to be adopted in most cases, or 
whatever. They simply represent forms of judgement which may often be felt to 
be appropriate and whose detailed analysis provides illuminating insight into the 
specification and interpretation of certain classes of predictive models. 


4.2.2 Exchangeability and Partial Exchangeability 


Suppose that, in thinking about P(.r). .... .r,,), his or her joint degree of belief 
distribution for a sequence of random quantities .r)...... r,,. an individual makes 
the judgement that the subscripts. the “labels” identifying the individual random 
quantities. are “uninformative”. in the sense that he or she would specify all the 
marginal distributions for the individual random quantities identically, and similarly 
tor all the marginal joint distributions for all possible pairs. triples. etc.. of the 
random quantities. It is easy to see that this implies that the form of the joint 
distribution must be such that 


for any possible permutation = of the subscripts {1..... n}. We formalise this 
notion of “symmetry” of beliefs for the individual random quantities as follows. 
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Definition 4.2. (Finite exchangeability). The random quantities ©\,...,Zy 
are said to be judged (finitely) exchangeable under a probability measure P 
if the implied joint degree of belief distribution satisfies 


P(x, Siete -In) => P(r), see iExtay) 


for all permutations x defined on the set {1,....n}. In terms of the corre- 
sponding density or mass function, the condition reduces to 


P(2). hee Zn) = P(Lx(1): ore Zx(n)): 


Example 4.1. (Tossing a thumb tack). Consider a sequence of tosses of a standard 
metal drawing pin (or thumb tack), and let :; = 1 if the pin lands point uppermost on the 
ith toss, z, = 0 otherwise, i = 1,....n. If the tosses are performed in such a way that 
time order appears to be irrelevant and the conditions of the toss appear to be essentially 
held constant throughout, it would seem to be the case that, whatever precise quantitative 
form their beliefs take, most observers would judge the outcomes of the sequence of tosses 
X),22,... to be exchangeable in the above sense. 


In general, the exchangeability assumption captures, for a subjectivist interested 
in belief distributions for observables, the essence of the idea of a so-called 
“random sample”, This latter notion is, of course, of no direct use to us at this 
stage, since it (implicitly) involves the idea of “conditional independence, given 
the value of the underlying parameter’, a meaningless phrase thus far within our 
framework. 


The notion of exchangeability involves a judgement of complete symmetry 
among all the observables .r),..., x, under consideration. Clearly, in many situa- 
tions this might be too restrictive an assumption, even though a partial judgement 
of symmetry is present. 


Example 4.1. (cont. ). Suppose that the sequence of tosses of a drawing pin are not 
all made with the same pin, but that the even and odd numbered tosses are made with 
different pins: an all metal one for the odd tosses: a plastic-coated one for the even tosses. 
Alternatively, suppose that the same pin were used throughout, but that the odd tosses are 
made by a different person, using a completely different tossing mechanism from that used 
for the even tosses. In such cases, many individuals would retain an exchangeable form 
of belief distribution within the sequences of odd and even tosses separately, but might be 
reluctant to make a judgement of symmetry for the combined sequence of tosses. 
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Example 4.2. (Laboratory measurements). Suppose that .)..°2.... are real-valued 
measurements of a physical or chemical property of a given substance. all made on the same 
sample with the same measurement procedure. Under such conditions. many individuals 
might judge the complete sequence of measurements to be exchangeable. 

Suppose, however, that sequences of such measurements are combined from s different 
laboratories, the substance being identical but the measurement procedures varying from 
laboratory to laboratory. In this case. judgements of exchangeability for each laboratory 
sequence separately might be appropriate, whereas such a judgement for the combined 
sequence might not be. 


Example 4.3. (Physiological responses). Suppose that {.r,..r2..... } are real-valued 
measurements of a specific physiological response in human subjects when a particular 
drug is administered. If the drug is administered at more than one dose level and if there 
are both male and female subjects, spanning a wide age range, most individuals would be 
very reluctant to make a judgement of exchangeability for the entire sequence of results. 
However. within each combination of dose-level, sex and appropriately defined age-group. 
a judgement of exchangeability might be regarded as reasonable. 


Judgements of the kind suggested in the above examples correspond to forms 
of partial exchangeability. Clearly, there are many possible forms of departure 
trom overall judgements of exchangeability to those of partial exchangeability and 
so a formal definition of the term does not seem appropriate. In general, it simply 
signifies that there may be additional “labels” on the random quantities (for example. 
odd and even, or the identification of the tossing mechanism in Example 4.1) with 
exchangeable judgements made separately for each group of random quantities 
having the same additional labels. A detailed discussion of various possible forms 
of partial exchangeability will be given in Section 4.6. 


We shall now return to the simple case of exchangeability and examine in detail 
the form of representation of p(r)...... r,,) Which emerges in various special cases. 
As a preliminary, we shall generalise our previous definition of exchangeability 
to allow for “potentially infinite” sequences of random quantities. In practice, it 
should. at least in principle. always be possible to give an upper bound to the number 
of observables to be considered. However, specifying an actual upper bound may be 
somewhat difficult or arbitrary and so, for mathematical and descriptive purposes. it 
is convenient to be able to proceed as if we were contemplating an infinite sequence 
of potential observables. Of course, it will be important to establish that working 
within the infinite framework does not cause any fundamental conceptual distortion. 
These and related issues of finite versus infinite exchangeability will be considered 
in more detail in Section 4.7.1. For the time being, we shall concentrate on the 
“potentially infinite” case. 
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Definition 4.3. (Infinite exchangeability). The infinite sequence of ran- 
dom quantities x,,22,... is said to be judged (infinitely) exchangeable if 
every finite subsequence is judged exchangeable in the sense of Definition 4.2. 


One might be tempted to wonder whether every finite sequence of exchange- 
able random quantities could be embedded in or extended to an infinitely exchange- 
able sequence of similarly defined random quantities. However, this is certainly 
not the case as the following example shows. 


Example 4.4. (Non-extendible exchangeability). Suppose that we define the three 
random quantities x), :c2, 23 such that either 1; = 1 or 1; = 0, i = 1.2.3, with joint 
probability function given by 

Play = 0.22 = ling = 1) = p(y = Lore = 0,03 = 1) 
= pir) = bay = Ly = 0) 
= 1/3, 
with all other combinations of x. ro, 2; having probability zero, so that 7. 72, 23 are clearly 


exchangeable. We shall now try to identify an x,, taking only values 0 and 1, such that 
x\.....a4 are exchangeable. For this to be possible, we require, for example, 


P(r, = 0,22 = Leary = biny = 0) = p(x) = 0.29 = Ory = bey = 1), 


But 
p(a = 0,2) = l.ry = 1,2, = 0) 

= p(x, = 0.92 = Ley = 1) - p(y = Og = Loy = Ley = 1) 

= 1/3 — p(x) =O. = lary = Liry = 1) 

= 1/3 — p(x, = Lrg = Lory = Lay = 0). 
where 

pia, = Lary = Lay = Lay =0) < play = bea, = Lr = 1) = 0. 

so that 


p(x, = 0,22 = 1x3 = lery = 0) = 1/3. 
However, we also have 
p(z; = 0,22 = 0.23 = Lex, = 1) < p(x, = 0,22 = 0.23 = 1) = 0 
and so 
P(r, =O.a2 = lary = lay = 0) £ p(x, = 0.72 = 0.73 = Lz, = 1). 


It follows that a finitely exchangeable sequence cannot even necessarily be embedded 
in a larger finitely exchangeable sequence, fet alone an infinitely exchangeable sequence. 
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4.3 MODELS VIA EXCHANGEABILITY 
4.3.1 The Bernoulli and Binomial Models 


We consider first the case of an infinitely exchangeable sequence of 0-1 random 


quantities, r,.2)..... with r; = Oor.r, = 1. for all = 1,2..... Without loss 
of generality, we shall derive a representation result for the joint mass function, 
p(.t).....2,), of the first random quantities .r)...... Pi 


Proposition 4.1. (Representation theorem for 0-1 random quantities). 

If x). 22.... ts an infinitely exchangeable sequence of 0-1 random quantities 
with probability measure P, there exists a distribution function Q such that 
the joint mass function pty... ty) for wpe r, has the form 


priests) - [Teo )" dQ(6). 
where, 
Q(@) = lim Ply, /n < 4. 
oe 
with y, = xy te +2, and @ = lim,—~ y,/n. 


Proof. (De Finetti, 1930, 1937/1964; here we follow closely the proof given 
by Heath and Sudderth, 1976; see also Barlow, 1991). Suppose ry ++ +° +0, = Yn. 
then, by exchangeability. for any 0 < y,, <n. 


nM 
P(r, tr thy = Yn) = « ) pers cote eta Vsiny) 


for any permutation 7 of {1..... n} such that ag.) t-* °F rin) = Yo Moreover, 
for arbitrary N > n > y, > 0, and with the summations below taken over the 
range yv = Yn toyn = N — (n— y,,), we see that 


pry toe try, = Un) 


= > pln terootay, = yn, lap teo tay = yn) plep ters tay = yn). 


N- yy f 
-> (3 . ie ut BY porn toes bay = ay). O<y, cnn. 
Yn Mm Un 
= ) = CSAC AUIS Play to tay = yn). 
Yu ()s 


where (yv'),,, = yx (yv — 1) ++> (yx — (ya — 2b). ete. (Intuitively. we can imagine 
sampling 72 items without replacement from an urn of N’ items containing yy I's 
and V — yy 0°s, corresponding to the hypergeometric distribution of Section 3.2.2.) 
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If we now define Q (8) on & to be the step function which is 0 for @ < 0 and 
has jumps of p(x, +--- +20 = yn) at @ = yx/N.yn =0,...,N, we see that 


1 : = , 
p(t) te +2n = yn) = oF WNiet a dQx(6). 


AsN > x, 
(ON), [(1 < ON | nn 
(N)n 


uniformly in 9. Moreover, by Helly’s theorem (see, for example, Section 3.2.3 and 
Ash, 1972, Section 8.2), there exists a subsequence Qy, 5 Qo: ... such that 


= Qun 1 a @)"" Ya 


lir / =Q, 
ya Q, 


where Q is a distribution function. The result follows. g 


The interpretation of this representation theorem is of profound significance 
from the point of view of subjectivist modelling philosophy. It is as if: 


(i) the x; are judged to be independent, Bernoulli random quantities (see Sec- 
tion 3.2.2) conditional on a random quantity @; 
(ii) @ is itself assigned a probability distribution Q, 
(iii) by the strong law of large numbers, ? = lim,,—~ (y,,/7), so that Q may be 
interpreted as “beliefs about the limiting relative frequency of I’s”. 


In more conventional notation and language, it is as if, conditional on @, 
21,....2, are a random sample from a Bernoulli distribution with parameter 6, 
generating a parametrised joint sampling distribution 


n 


P(Ly.---. Lp | 0) = |] plz, 19) = iae: — 6y'-7i, 
isl 


i=] 


where the parameter is assigned a prior distribution Q(0). The operational content 
of this prior distribution derives from the fact that it is as if we are assessing beliefs 
about what we would anticipate observing as the /imiting relative frequency froma 
“very large number” of observations. Thought of as a function of 6, we shall refer 
to the joint sampling distribution as the likelihood function. 

In terms of Definition 4.1, the assumption of exchangeability for the infinite 
sequence of O-I random quantities 7).22,... places a strict limitation on the 
family of probability measures P which can serve as predictive probability models 
for the sequence. Any such P must correspond to the mixture form given in 
Proposition 4.1, for some choice of prior distribution Q(#). As we range over 
all possible choices of this latter distribution, we generate all possible predictive 
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probability models compatible with the assumption of infinite exchangeability for 
the O-| random quantities. 

Thus, “at a stroke”, we establish a justification for the conventional model 
building procedure of combining a likelihood and a prior. The likelihood is defined 
in terms of an assumption of conditional independence of the observations given 
a parameter; the latter, and its associated prior distribution, acquire an operational 
interpretation in terms of a limiting average of observables (in this case a limiting 
frequency). 

In many applications involving O0—| random quantities. we may be more in- 
terested in a summary random quantity, such as y, = .°) +++: +.r,, than in the 
individual sequences of .r,’s. The representation of p(y +--+ + 0), = Yu) is 
straightforwardly obtained from Proposition 4. |. 


Corollary 1. Given the conditions of Proposition 4.1, 


| 
P(t, +--° +2 = Yn) = @ ena — 6)" ™ dQ(A). 
13] Gu 


Proof. This follows immediately from Proposition 4.1 and the fact that 


n 
Pir bese ty, = Yu) = @ ) p(y eid r,) 


Yn 


for all az)... .. az, such that; +--+ +4, =Y,- 


This provides a justification, when expressing beliefs about y,,. for acting as 
if we have a binomial likelihood. defined by Bi(y, | @.). with a prior distribution 
Q(6@) for the binomial parameter @. 


The formal learning process for models such as this will be developed sys- 
tematically and generally in Chapter 5. However, this simple example provides 
considerable insight into the learning process, showing how, in a sense. the key 
step is a straightforward consequence of the representation theorem. 


Corollary 2. /f.x)..22.... is an infinitely exchangeable sequence of 0-1 ran- 
dom quantities with probability measure P, the conditional probability func- 


TON D(Xirt lenses: Ti | Vpeckea Vin) FO Vine levee Py, PIVEN Lye... es v,,. has 
the form 
I ied 
| I] OF (1 — 8)! “dQ(O |r... ee Ma) Leime<cn, 
0 


itt 
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where 


TTi2, 9% (1 ~ 4)! *1dQ(6) 


dOle}: rreey lin} = 
OTE) TTT Od — 8) dQ) 


and 
Q(6) = lim P(y,/n < 6). 


Proof. Clearly. 


Os eee! 2} 
P(2mng ie. --s En | Tyee. lm) = Rae : 
p(2y..... Dn) 
and the result follows by applying Proposition 4.1 to both p(z,, ..., £,) and 
p{Z,.---+2m) and rearranging the resulting expression. g 


We thus see that the basic form of representation of beliefs does not change. 
All that has happened, expressed in conventional terminology, is that the prior 
distribution Q() for @ has been revised, via Bayes’ theorem, into the posterior 
distribution Q(6|21,....2m)- 


The conditional probability function p(2yn41,.--.%n|21,-+-.2Zm) is called 
the (conditional, or posterior) predictive probability function for Zm41...-. Ln 
given r,,...,2m, and this, of course, also provides the basis for deriving the con- 
ditional predictive distribution of any other random quantity defined in terms of the 


future observations. For example, given z,..... 27n.. the predictive probability func- 
tion p(Yn-m | Li..- +s Lm) for Yy-m. ie., the total number of I's in ry41,....2ns 
has the form 


1 _ 
1c gm — 0)!" madQ(B 21. 
0 Yn-m 


A particularly important random quantity defined in terms of future obser- 
vations is the frequency of 1’s in a large sample. But, by Proposition 4.1 and its 
Corollary 2, 


‘ Yn-m 
—#a-"__ < Olz)....,4m) = 21,--..2m)- 
tin P(e S Olen. c ) Q8} 2 Tm) 


Thus, a posterior distribution for a parameter is seen to be a limiting case of 
a posterior (conditional) predictive distribution for an observable. 
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4.3.2 The Multinomial Mode! 


Analternative way of viewing the 0 -I random quantities discussed in Section 3.1 is 
as defining category membership (given two exclusive and exhaustive categories). 
in the sense that .r; = | signifies that the ith observation belongs to category | and 
1, = 0 signifies membership of category 2. We can extend this idea in an obvious 
way by considering 4-dimensional random vectors x; whose jth component, .,,. 
takes the value | to indicate membership of the jth of 4 + 1 categories. At most one 
of the / components can take the value |: if they all take the value 0 this signifies 
membership of the (4 + 1)th category. In what follows, we shall refer to such 2; 
as “O-I random vectors”. If x. a2.... is an infinitely exchangeable sequence of 
0-1 random vectors, we can extend Proposition 4.1 in an obvious way. 


Proposition 4.2. (Representation theorem for 0-1 random vectors). 
ifx).a2.... is an infinitely exchangeable sequence of 0-1 random vectors 
with probability measure P. there exists a distribution function Q such that 
the joint mass function p(aj4...-. @,,) for z,..... z,, has the form 


and 
Q(@) = lim P [Fu SO.) U-..U (Fen < %)I. 


nex 


with Fi, =n Mary +++ +.",;) and 6, = lim, .. Fin. 


Proof. This is a straightforward, albeit algebraically cumbersome, generali- 
sation of the proof of Proposition 4.1. g 


As in the previous casc, we are often most interested in the summary random 
vector y, = x; + +--+ x, whose jth component y,,, is the random quantity 
corresponding to the total number of occurrences of category j in the 1: observations. 
We shall give the representation of pia; +--- +2, = y,,) = Plyni----- Yuk). 
generalising Corollary | to Proposition 4.1, and then comment on the interpretation 
of these results. 
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Corollary. Given the conditions of Proposition 4.2, the joint mass function 
P(Yni.---+Ynk) may be represented as 


de (, 1 is y; _) ate ote O11 — 56;)" =" dQ(@) 


where 


( n ) _ n! 
Yar? Ynk Yni!Yn2! Pes Yn l(n - Lyn)! 

Proof. This follows immediately from the generalisation of the argument used 
in proving Corollary | to Proposition 4.1. 


Thus, we see in Proposition 4.2 that it is as if we have a likelihood corre- 
sponding to the joint sampling distribution of a random sample x),...,Z,,, where 
each z; has a multinomial distribution with probability function Mu,(a; | @.1), 
together with a prior distribution Q over the multinomial parameter 6, where the 
components 6; of the latter can be thought of as the limiting relative frequency of 
membership of the jth category. In the corollary, it is as jf we assume a multinomial 
likelihood, Mu;(y,,. |@.7), with a prior Q(@) for 6. 


4.3.3. The General Model 


We now consider the case of an infinitely exchangeable sequence of real-valued 
random quantities 2), z2,.... As one might expect, the mathematical technicalities 
of establishing a representation theorem in the real-valued case are somewhat more 
complicated than in the 0-1 cases, and a rigorous treatment involves the use of 
measure-theoretic tools beyond the general mathematical level at which this volume 
is aimed. For this reason, we shall content ourselves with providing an outline proof 
of a form of the representation theorem, having no pretence at mathematical rigour 
but, hopefully, providing some intuitive insight into the result, as well as the key 
ideas underlying a form of proper proof. 


Proposition 4.3. (General representation theorem). 

Uf 21, 29...., is an infinitely exchangeable sequence of real-valued random 
quantities with probability measure P, there exists a probability measure Q 
over 3, the space of all distribution functions on R, such that the joint distri- 
bution function of 7....,2n has the form 


Pees i [] F@ea(r). 
rs | 


where 


QUF) = jim PUFs) 


and F,, is the empirical distribution function defined by r,,...,2n- 
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Outline proof. (See Chow and Teicher. 1978/1988). Since 


we have, by exchangeability. 


[NV — n| 
Nn 


E(F,(x) — Fy(x))? = {PC <t) — Pry <r) N(t2 <x), 


To see this, writing J; in place of /:,,.. - and noting that I? = I, we have 
[Fa(x) - Fy(2)] "=(33 >: N2 Loa x) 
a NV 
1 Serceea sil 
+2 (25 yyy ys »>) (11). 
yt} dobey po SEY 


Note also that E(J;) = P(x) < x) and E(\UjJ}) = Pl(ai < 2) A (a2 < 4)), 
for all i. j, by exchangeability. A straightforward count of the numbers of terms 
involved in the summations then gives the required result. 

The right-hand side tends to zero as Not — x. and hence the random quantity 
F,,() tends in probability to some random quantity, F'(.°). say. which implies that 


[[ fvte) = TPF) (x) 
pel jal 


in probability as N — ox. for fixed n. 


Suppose we now let a)..... a, denote positive integers and set 
A= f{a=(a)..... a,)! l<a,<Nforl<i<un} 
and 
Av = I(a) = T[(tay Says (ata, Sa, )]- 


For NV > n. it then follows that 


n n v 

[Lev@n= NT] do taj Hn Ye Ha) 

jet galore acd 
=N 4 +S +X J 1) 


ac de A* ne At 
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However, as N — 00, 


N™ So Ma)<N™ SO 1=[N"-N(N-1)---(N=n+1)]/N" +0 
ae€A—A* aeA-A* 


so that, 


Il Fx (x,) = No" > I(a) 
j=l 


ag A* 


But, by exchangeability, 


[r@) aP = f Ieseyyvetesen) dP 


= Plz <2) N= N( <2, )] = Peis... £a) 


and so 
i. [[ Fv(adP © [No (N n+ D)/N" PU <a) N--A(e Say). 
v= 
Recalling (*), we see that, as N — oo, 
[TF ear) m P(x1,..-,2n) 
j=l 


where Q(F) = limyix P(Fw). 4 


The general form of representation for real-valued exchangeable random quan- 
tities is therefore as if we have independent observations 2,,..., 2, conditional on 
F, an unknown (i.e., random) distribution function (which plays the role of an 
infinite-dimensional “parameter” in this case), with a belief distribution Q for F, 
having the operational interpretation of “‘what we believe the empirical distribution 
function would look like for a large sample”. 

The structure of the learning process for a general exchangeable sequence of 
real-valued random quantities, with the distribution function representation given 
in Proposition 4.3, cannot easily be described explicitly. In what follows, we shall 
therefore find it convenient to restrict attention to those cases where a corresponding 
representation holds in terms of density functions, labelled by a finite-dimensional 
parameter, 0, say, rather than the infinite-dimensional label, F. For ease of refer- 
ence, we present this representation as a corollary to Proposition 4.3. 
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Corollary 1. Assuming the required densities to exist, under the conditions 
of Proposition 4.3 the joint density of ©).....2, has the form 


plar.....tn) = f TY (218) 416), 
9 t=} 
with p(-|@) denoting the density function corresponding to the “unknown 
parameter” 0 € ©. 


The role of Bayes’ theorem in the learning process is now easily identitied. 


Corollary 2. [f x;.xr2,... is an infinitely exchangeable sequence of real- 
valued random quantities admitting a density representation as in Corollary 1, 
then 


PlDintioe ees tel Pivsce stn) = [ II p(x; | @) dQ(@|2,...... Ym) 


where TY}, p(z, | @) dQ(a) 
dQ(O|21..... Im) = oS 
QO (21 Lm) 9 IT P(x: 19) €Q(8) 


Proof. This follows immediately on writing 


P(xry..... In) 
Dhl vincwdy [Qi peess3 Bn 
P(Din+1 [ry ) pa... rays) 
applying the density representation form to both p(.r;...... r,) and p(y... .. Un). 


and rearranging the resulting expression. g 


The technical discussion in this section has centred on exchangeable se- 
quences, 21.22,.... of real-valued random quantities. In fact, everything carries 
over in an obviously analogous manner to the case of exchangeable sequences 
X).X2,..., with x; € R*. All that happens, in effect, is that the distribution func- 
tions and densities referred to in Proposition 4.3 and its corollaries become the 
joint distribution functions and densities for the k components of the x;. To avoid 
tedious distinctions between .r € R and x € R*, in subsequent developments we 
shall often just write x € X. In cases where the distinction between & = | and 
k > 1 matters, it will be clear from the context what is intended. 

In Section 4.8.1, we shall give detailed references to the literature on represen- 
tation theorems for exchangeable sequences, including far-reaching generalisations 
of the 0 —! and real-valued cases. However. even the simple cases we have presented 
already provide, from the subjectivist perspective, a deeply satisfying clarification 
of such fundamental notions as models, parameters, conditional independence and 
the relationship between beliefs and limiting frequencies. 
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In terms of Definition 4.1, the assumption of exchangeability for the real- 
valued random quantities 1), 22,... again places (as in the 0-1 case) a limitation 
on the family of probability measures P which can serve as predictive probability 
models. In this case, however, in the context of the general form of representation 
given in Proposition 4.3, the “parameter”, F’, underlying the conditional indepen- 
dence structure within the mixture is a random distribution function, so that the 
“parameter” is, in effect, infinite dimensional, and the family of coherent predictive 
probability models is generated by ranging through all possible prior distributions 
Q(F). The mathematical form of the required representation is well-defined, but 
the practical task of translating actual beliefs about real-valued random quantities 
into the required mathematical form of a measure over a function space seems, 
to say the least, a somewhat daunting prospect. It is interesting therefore to see 
whether there exist more complex formal structures of belief, imposing further 
symmetries or structure beyond simple exchangeability, which lead to more spe- 
cific and “familiar” model representations. In particular, it is of interest to identify 
situations in which exchangeability leads to a mixture of conditional independence 
structures which are defined in terms of a finite dimensional parameter so that the 
more explicit forms given in the corollaries to Proposition 4.3 can be invoked. 
Given the interpretation of the components of such a parameter as strong law limits 
of simple sequences of functions of the observations, the specification of (, and 
hence of the complete predictive probability model P, then becomes a much less 
daunting task. 


4.4 MODELS VIA INVARIANCE 
4.4.1. The Normal Model 


Suppose that in addition to judging an infinite sequence of real-valued random 
quantities 2;,22,... to be exchangeable, we consider the possibility of further 
judgements of invariance, perhaps relating to the “geometry” of the space in which 
a finite subset of observations, x7),...,2, say, lie. The following definitions 
describe two such possible judgements of invariance. As with exchangeability, 
there is no claim that such judgements have any a priori special status. They are 
intended, simply, as possible forms of judgement that might be made, and whose 
consequences might be interesting to explore. 


Definition 4.4. (Spherical symmetry). A sequence of random quantities 
1, .... Zn is said to have spherical symmetry under a predictive probability 
model P if the latter defines the distributions of x = (1,...,2,) and Az to 
be identical, for any (orthogonal) n x n matrix A such that A'A = I. 
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This definition encapsulates a judgement of rotational symmetry, in the sense 
that. although measurements happened to have been expressed in terms of a par- 
ticular coordinate system (yielding z,...... Tr, ), Our quantitative beliefs would not 
change if they had been expressed in a rotated coordinate system. Since rota- 
tional invariance fixes “distances” from the origin, this is equivalent to a judgement 
of identical beliefs for all outcomes of .7;..... x, leading to the same value of 
a eee ide 

The next result states that if we make the judgement of spherical symmetry 
(which in turn implies a judgement of exchangeability. since permutation is a special 
case of orthogonal transformation). the general mixture representation given in 
Proposition 4.3 assumes a much more concrete and tamiliar form. 


Proposition 4.4. (Representation theorem under spherical symmetry). 

Uf 2) .:ty.... is an infinite sequence of real-valued random quantities with 
probability measure P, and if, for any n, {ry...... t, } have spherical symme- 
try, there exists a distribution function Q on R* such that the joint distribution 
function of x,.....2%) has the form 


Pltievetn) = fT @a!2 dQ. 
cee alae coe 


where ® is the standard normal distribution function and 


Q(A) = lim P(s,;? < A). 


ee -, 


with s2 = n7!(x? +--+» 4.72), and X*! = lim, .x 82. 

Proof. See, for example, Freedman (1963a) and Kingman (1972): details are 
omitted here, since the proof of a generalisation of this result will be given in full 
in Proposition 4.5. g 


The form of representation obtained in Proposition 4.4 tells us that the judge- 
ment of spherical symmetry restricts the set of coherent predictive probability 
models to those which are generated by acting as if: 


(i) observations are conditionally independent normal random quantities, given 
the random quantity A (which. as a “labelling parameter”. corresponds to the 
precision; i.e., the reciprocal of the variance); 

(ii) 4 is itself assigned a distribution Q: 

(iii) by the strong law of large numbers, A~' = lim,—. 87. so that Q may be 
interpreted as “beliefs about the reciprocal of the limiting mean sum of squares 
of the observations”. 
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For related work see Dawid (1977, 1978). To obtain a justification for the usual 
normal specification, with “unknown mean and precision”, we need to generalise 
the above discussion slightly. 

We note first that the judgement of spherical symmetry implicitly attaches a 
special significance to the origin of the coordinate system, since it is equivalent 
to a judgement of invariance in terms of distance from the origin. In general, 
however, if we were to feel able to make a judgement of spherical symmetry, it 
would typically only be relative to an “origin” defined in terms of the “centre” of 
the random quantities under consideration. This motivates the following definition. 


Definition 4.5. (Centred spherical symmetry). A sequence of random quanti- 
ties Z\,...,Zy iS said to have centred spherical symmetry if the random quanti- 
ties 11 -Fy...-.L_—2,, have spherical symmetry, where XZ, = n7! >> xj. This 
is equivalent to a judgement of identical beliefs for all outcomes of X,,..., Ln 
leading to the same value of (x; — E,)? +++» +(2n — Zn)’. 


Proposition 4.5. (Representation under centred spherical symmetry). 

If x\,2,... is an infinitely exchangeable sequence of real-valued random 
quantities with probability measure P, and if, for any n, {11,...2,} have 
centred spherical symmetry, then there exists a distribution function Q on 
R x R* such that the joint distribution of r,....,2n has the form 


Piece) =f TPS: m)ldQ(u,2), 
Ret 


where ® is the standard normal distribution function and 

Q(u,d) = lim P[(Zn <u) M(s,? < d)]. 
with £, = n-"(a, +--+ +2y), 82 = nl [(z) — B,)? +--+ + (tn - En)’, 
p= limy—« En, and X*! = lim,» s?. 


Proof. (Smith, 1981). Since the sequence 2;,2»,... is exchangeable, by 
Proposition 4.3 there exists a random distribution function F such that, condi- 
tional on F’, the random quantities 2 |,...,2,,, for any 7, are independent. There 
is therefore a random characteristic function, @, corresponding to F’, such that 


E ew (‘0481) | = [[ot) 
j=l j=l 


E ex (: » un)| a6 iW at , 


and hence 
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If we now define y, =v) —4,.J = 1..... n, it follows that 


exp (s-)| =E oo) (=) 
pel pel 


for all real s,.....: s, such that s; +--+: +s, = 0. Since y..... Y, are spherically 
symmetric, both sides of this latter equality depend only on s7 +--+ +7. 

Recalling that @(—t) = ©(t). the complex conjugate, and that @(0) = 1. it 
follows that. for any real u and v. 


E 


E{ lou + ejo(u-—r)- 0*(u)d(w)o(-e)|"} 
= B{ou + rye(u— eye(—u— rjo(r- u)} 
— Efolut co(u — v)o*(-a)o(-rjo(e)} 
— F{o(-u = vole = wo"(ujo(rjo(-r)} 
+ E{o*(ujo"(vjo*(-r)o*(-u)}. 
where ail four terms in this expression are of the form of the right-hand side of (*) 
with n = 8,5; +++: +s, = Oand sj +--+» +82 = 4(u? +07). All the four terms 


are therefore equal, so that the overall expression is zero. This implies that. almost 
surely with respect to the probability measure P. @ satisfies the functional equation 


Aut vjol(u— ce) = &2(ao(vjo(-v) 
for all real « and «. This can be rewritten in the form 
Vi(ute)+ Vo(u -— vv) = Alu) + Ble). 


where V1 (ft) = WY.(t) = log o(t), and where A(u) = 2log@(u) and B(v) = 
log[o()o(—v)]; it follows that log (tf) is a quadratic in ¢ (see, for example. Kagan. 
Linnik and Rao, 1973, Lemma 1.5.1). Again using o(—f) = 0(f).0(0) = 1. 
we see that, for this quadratic, the constant coefficient must be zero. the linear 
coefficient purely imaginary and the quadratic coefficient real and non-positive. 
This establishes that the random characteristic function © takes the form 


1? 
A(t) = exp ¢ ipet -— = 
(t) exp {in sx} 


for some random quantities pp € R.A ER. 
If we now define a random quantity = by 
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By \ 


n " ie 
oP (S351) Bs | = [exp (i - 54) ; 
j=l jel 


This establishes that, conditional on jz: and A, r;,..., 2» are independent nor- 
mally distributed random quantities, each with mean y and precision A. The mixing 
distribution in the general representation theorem reduces therefore to a joint dis- 
tribution over y and A. But, by the strong law of large numbers, 


then, by iterated expectation, we have 


E(z|p,A) = E[E(z| F)| yA] = E Tos 
j=1 


so that 
E 


er ee SJL 

oe mm HB, 
lim (a1 — En)? +++ + (In — En)? gk 
R90 n » 


and the result follows. g 


We see, therefore, that the combined judgements of exchangeability and cen- 
tred spherical symmetry restrict the set of coherent predictive probability models 
to those which, expressed in conventional terminology, correspond to acting as if: 


(i) we have arandom sample from a normal distribution with unknown mean and 
precision parameters, yz and 4, generating a likelihood 


n 
Piura) =][ M@ilu.) 


i=l 


(ii) we have a joint prior distribution Q(p, ) for the unknown parameters, 
and A, which can be given an operational interpretation as “‘beliefs about the 
sample mean and reciprocal sample variance which would result from a large 
number of observations”. 


4.4.2. The Multivariate Normal Model 


Suppose now that we have an infinitely exchangeable sequence of random vectors 
Zi, 22,... taking values in R*,k > 2, and that, in addition, we judge, for all 
n and for all c € §*, that the random quantities c'x,,...,c'x, have centred 
spherical symmetry. The next result then provides a multivariate generalisation of 
Proposition 4.5. 
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Proposition 4.6. (Multivariate representation theorem under centred spher- 
ical symmetry). If x,.22.... is an infinitely exchangeable sequence of ran- 
dom vectors taking values in R*, with probability measure P, such that. for 
anynandc € R*, the random quantities c'n)..... e'x,, have centred spher- 
ical symmetry, the structure of evaluations under P of probabilities of events 
defined by X)..... z,, is as if the latter were independent, multivariate nor- 
mally distributed random vectors, conditional on a random mean vector jt 
and a random precision matrix X, with a distribution over ps and X induced 
by P, where 
“ 


] 1 
= li at 5 a N= li = ; — £, —Z, Y, 
r ene né 1 - ee n Ai " Mw, - ) 
i= J= 


Proof. Defining y, = c'x,.j =1..... n. we see that the random quantities 
Y1.--++ Yn have centred spherical symmetry and so, by Proposition 4.5, there exist 
pe = ue(e) and A = X(c) such that, forall t, E Rij =1..... n. 

if 
Eom >t) ls " ev nil - 
where 
l Nn ¥ 
y= w(e) = lim = 2 yy. ATs = lim Lu — yn)? 
J= 
But 
1 o ; 
p= p(c)=c' lim {| — xz, j=Hcp 
nae ®w n 
yl 
and 
1 ao 
AN ed Ne) =e ne ; >», ~ £,)(@, — 2) c=c'X ec. 
j=l 
so that 


nu 


= [[explie’ ut; = S(c'X 'et#)). 


gel 


EX 


E ow (i > vs) 
jo} 


forallce Rt; ER Hl... n. It follows that, for all £; € RY. j = 1.2... .n. 


Elo» (73 ve = ped 


= TTexpliv't, — 5 ()a"'t,) 
+1 


so that, conditional on yw and A.z,..... x,, are independent multivariate normal 
random quantities each with mean yp and precision matrix A. 
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4.4.3. The Exponential Model 


Suppose x1, r2, ... is judged to be an infinitely exchangeable sequence of positive 
real-valued random quantities. In particular, we note that this implies, for any 
pair z;,z;, an identity of beliefs for any events in the positive quadrant which are 
symmetrically placed with respect to the 45° line through the origin. 


Figure 4.1 A), A, B,, Bo reflections in 45° line. C,, C2 reflections in (dashed) 45° line 


Thus, for example, in Figure 4.1, the probabilities assigned to A, and Ao, By 
and Bo, respectively, must be equal, for any i # j. In general, however, the 
assumption of exchangeability would not imply that events such as C’; and C2 have 
equal probabilities, even though they are symmetrically placed with respect to a 
45° line (but not the one through the origin). 


It is interesting to ask under what circumstances an individual might judge 
events such as C;, C2 to have equal probabilities. The answer is suggested by the 
additional (dashed) lines in the figure. /f we added to the assumption of exchange- 
ability the judgement that the “origins” of the rz; and z; axes are “irrelevant”, so 
far as probability judgements are concerned, then the probabilities of events such 
as C and C2 would be judged equal. In perhaps more familiar terms, this would 
be as though, when making judgements about events in the positive quadrant, an 
individual's judgement exhibited a form of “lack of memory” property with respect 
to the origin. If such a judgement is assumed to hold for all subsets of 7 (rather 
than just two) random quantities, the resulting representation is as follows. 
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Proposition 4.7. (Continuous representation under origin invariance). 

If x\,22,... is an infinitely exchangeable sequence of positive real-valued 
random quantities with probability measure P., such that, for all n, and any 
event Ain Rt x... x Rt, 


Pi(2i....,2n) € A] = Pl(ay.....t,) € A+ a] 


foralla € Rx...xRsuch thata'l = Vand A+aisanevent in Rt x...xR*, 
then the joint density for x,,....2,, has the form 


P(T1...-5 Ln) = ie [] ¢exp(-e2:)4Q(@). 


i=] 


where 0 = lim, 7,', and 
Q(0) = lim Pigs) <6) a= no" (ay ees + ty): 
nx 


Outline proof. (Diaconis and Y|visaker, 1985). By the general representation 
theorem, there exists a random distribution function F’, such that, conditional on 
Fiay...., I, are independent, for any n. It can be shown that the additional 


invariance property continues to hold conditional on F,, so that, for any 7 # j, 
Pi(xj.2;) € A| F] = Pl(z,.2;)€ A+a|F] 
for A and a as described above. If we now take a! = (a), a2) and 
A= {(zi.4,); 2, > a; + 42,2, > 0} 
we have 


P{(zi > a, + a2) A(x, > 0)|F] = Pl(a, > a) (a; > a2) | F| 
= P(x, > a;)| FIP{(x, > a2) | F}. 


By exchangeability, and recalling that r, is certainly positive for all j, this implies 
that 
P(a; > a, +a2|F) = P(x, > a, | F)P (2; > a2| F). 


But this functional relationship implies, for positive real-valued ,, that 
piri >| F) se * 
for some @, so that the density, p(.r, | F) = p(x, | 0), is the derivative of 
| — exp(—8z,). 


and hence given by @ exp(—6@.2,). The rest of the result follows on noting that, by 
the strong law of large numbers, 07! = lim, [v7 (0, #---4+-0,)]- 
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Thus, we see that judgements of exchangeability and “lack of memory” for 
sequences of positive real-valued random quantities constrain the possible predic- 
tive probability models for the sequence to be those which are generated by acting 
as if we have a random sample from an exponential distribution with unknown 
parameter 0, with a prior distribution Q for the latter. In fact, if Q* denotes the 
corresponding distribution for ¢ = @-! = lim,,_x E,, it may be easier to use the 
“reparametrised” representation 


p(ati..--+2n) = [oT] exvt-0"')29"(0), 
i=l 


since Q* is then more directly accessible as “beliefs about the sample mean from 
a large number of observations”. 

Recalling the possible motivation given above for the additional invariance 
assumption on the sequence 21, r2,..., it is interesting to note the very specific 
and well-known “lack of memory” property of the exponential distribution; namely, 


P(x, > a, + a2 | 6,2; > a1) = P(x, > a2|8). 


which appears implicitly in the above proof. 


4.4.4 The Geometric Model 


Suppose 1), 22,... is judged to be an infinitely exchangeable sequence of strictly 
positive integer-valued random quantities. It is easy to see that we could repeat 
the entire introductory discussion of Section 4.4.3, except that events would now 
be defined in terms of sets of points on the lattice Z* x ... x Z*, rather than as 
regions in R*+ x ... x R+. This enables us to state the following representation 
result. 


Proposition 4.8. (Discrete representation under origin invariance). 

If x, 22,... is an infinitely exchangeable sequence of positive integer-valued 
random quantities with probability measure P, such that, for all n and any 
event Ain Z*+ x... x Zt, 


PiiaeFa) € A] = Fl iasicede) E A+al 
foralla € Zx... x Z such that a'l = 0 and A +a is an event in 


Zt x... x Zt, then the joint density for r,,..., 2, has the form 
1 
p(t1.- +++ 2n) -| [[ 90 - 9)" '4Q(@). 
9 w=) 


where 0 = lim,.. F7'. 2, =n7'(z, +-+- +2), and 


Q(4) = lim P[(#,') < 4}. 
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Outline proof. This follows precisely the steps in the proof of Proposition 4.7. 
except that, for positive integer-valued 2;, the functional equation 


P(x; > a, +a2|F) = P(x, > a; | F)P(x, > a2 | F) 


implies that 
P(x, >r{F) = 6". 


so that the probability function, p(x, | F) = p(x, | @) iseasily seento be @(1—6)""!. 
Again, by the strong law of large numbers, 6 ' = lim,,... J. where, since .r, > 1 
forall i.0<A<1. g 


In this case, the coherent predictive probability models must be those which 
are generated by acting as ifwe have a random sample from a geometric distribution 
with unknown parameter 6, with a prior distribution Q for the latter, where 97! = 
lim, —x Ey. 

Again, recalling the possible motivation for the additional invariance property, 
it is interesting to note the familiar “lack of memory” property of the geometric 
distribution; 

P(a; > a, +@2{0.2; > a,) = P(r, > ar] 4). 


4.5 MODELS VIA SUFFICIENT STATISTICS 
4.5.1 Summary Statistics 


We begin with a formal definition, which enables us to discuss the process of sum- 
marising a sequence, or sample, of random quantities, .r)...... r,,. (In general, our 
discussion carries over to the case of random vectors. but for notational simplicity 
we shall usually talk in terms of random quantities.) 


Definition 4.6. (Statistic). Given random quantities (vectors) ay... .... Mins 
with specified sets of possible values X\...... X,,,. respectively, a random vec- 
tor ty, 2 Xy X01 xX, > REY (RO) <n) is called a k(n)-dimensional 
Statistic, 
A trivial case of such a statistic would be €,,,(71...... Un) = (y.ceeee iy). 
but this clearly does not achieve much by way of summarisation. since A(ar) = 17. 
Familiar examples of summary statistics are: 
t,, =m Mary +... 4+.2,,). the sample mean (k(n) -= 1): 
t,, = [mi(ay to. + tm) (ay e+ + 9%, )]. the sample size. toral and sun 
of squares (k(n) = 3): 
t,, = [m.med{ry...... r,,,}]. the sample size and median (h(a) = 2): 


t,, = max{ay...... ry, }—mingery...... 1, }.the sample range (k(m) = 1). 
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To achieve data reduction, we clearly need k(m) < m: moreover, as with the 
above examples, further clarity of interpretation is achieved if k(m) = k, a fixed 
dimension independent of m. 

In the next section, we shall examine the formal acceptability and implications 
of seeking to act as if particular summary statistics have a special status in the 
context of representing beliefs about a sequence of random vectors. We shall not 
concern ourselves at this stage with the origin of or motivation for any such choice 
of particular summary statistics. Instead, we shall focus attention on the general 
questions of whether, and under what circumstances, it is coherent to invoke such 
a form of data reduction and, if so, what forms of representation for predictive 
probability models might result. Throughout, we shall assume that beliefs can be 
represented in terms of density functions. 


4.5.2 Predictive Sufficiency and Parametric Sufficiency 


As an example of the way in which a summary statistic might be assumed to play 
a special role in the evolution of beliefs, let us consider the following general 


situation. Past observations 21,..., 2, are available and an individual is contem- 
plating, conditional on this given information, beliefs about future observations 
Im41s+++:En, to be described by p(tmi1,----Xn}L1,.--.2m). The following 


definition describes one possible way in which assumptions of systematic data 
reduction might be incorporated into the structure of such conditional beliefs. 


Definition 4.7. (Predictive sufficiency). 

Given a sequence of random quantities x, ,22,..., with probability measure P, 
where x; takes values in X;,i = 1.2,... the sequence of statistics t,,tz...., 
with t, defined on X, x --- x Xj, is said to be predictive sufficient for the 
Sequence 21,22,...if,forallm > 1,r > land {iy....,i-}N{1,...,m} = 9, 


P(21, 54+. Lip [2p.--- em) = P(Ly»-- + Li, | Em): 
where p(.| .) is the conditional density induced by P. 


The above definition captures the idea that, given ty, = tyy(Z1...-.2%m)s 
the individual values of x,,....2,, contribute nothing further to one’s evaluation 
of probabilities of future events defined in terms of as yet unobserved random 
quantities. Another way of expressing this, as is easily verified from Definition 4.7, 
is that future observations (x;,.....2;,) and past observations (x)....,2m,) are 
conditionally independent given t,,,. Clearly, from a pragmatic point of view the 
assumption of a specified sequence of predictive sufficient statistics will, in general, 
greatly simplify the process of assessing probabilities of future events conditional on 
past observations. From a formal point of view, however, we shall need additional 
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structure if we are to succeed in using this idea to identify specific forms of the 
general representation of the joint distribution of r)......1,,. 


As a particular illustration of what might be achieved, we shall assume in 
what follows that the probability measure P describing our beliefs implies both 
predictive sufficiency and exchangeability for the infinite sequence x, 02.,... AS 
with our earlier discussion in Section 4.4, a mathematically rigorous treatment is 
beyond the intended level of this book and so we shall confine ourselves to an 
informal presentation of the main ideas. 

In particular, throughout this section we shall assume that the exchangeability 
assumption leads to a finitely parametrised mixture representation, as in Corol- 
lary | to Proposition 4.3, so that, as shown in Corollary 2 to that proposition, the 
conditional density function of .,,,.)...... t,. given ry. ..... r,,,. has the form 


P(dinejaeees In [ay eg ll p(w; |A)dQ(E lary... ... Vn) 


rem 


where we 
TTi2, p(x, | 9)dQ(8) 
J TD, p(x, | @)dQ(6) 


and all integrals, here and in what follows, are assumed to be over the set of possible 
values of @. 

This latter form makes clear that, for such exchangeable beliefs, the learning 
process is “transmitted” within the mixture representation by the updating of beliefs 
about the “unknown parameter” @. This suggests another possible way of defining 
a statistic t,, = ty, (ap.. 6... Ui) to be a “sufficient summary” of .27)....-. Re jiye 


dQ(O | ry... tin) = 


Definition 4.8. (Parametric sufficiency). If x. 1». ... is an infinitely ex- 

changeable sequence of random quantities, where x; takes values in X, = 

X.i=1,2...., the sequence of statistics t).tz...., with t; defined on X, x 
- x X,. is said to be parametric sufficient for 1\..r2.... if, forany n > 1, 


dQ(O | aj... ty) = dQ(O|E,). 


for any dQ(@) defining an exchangeable predictive probability model via the 
representation 


P(Th... 0s r,) = [Ta |) dQ(e 


Definitions 4.7 and 4.8 both seem intuitively compelling as encapsulations 
of the notion of a statistic being a “sufficient summary”. It is perhaps reassuring 
therefore that, within our assumed framework, we can establish the following. 
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Proposition 4.9. (Equivalence of predictive and parametric sufficiencies). 
Given an infinitely exchangeable sequence of random quantities ©1,22,..., 
where x; takes values in X; = X,i = 1,2,..., the sequence of statistics 
t),t2,... with t; defined on X, x -.- x X; is predictive sufficient if, and only 
if, it is parametric sufficient. 


Heuristic proof. Forany 11,...,Zm,Zm+is---»Zn and any sequence of statis- 
tics t,,, where t,, =t,(r1,..-,2m),m = 1,...,n—1, the representation theorem 
implies that 


1 
P(Im41)- : -;In|tm) = Diba) Pamti +9 Fas bm), 


p(tn 
1 
= P(tm) Pais ++» Fm Emer,» Tn) dz\,...,d2m 


where A = {(2)..-,2m): tm(Zi,---,2m) = tm}, which, in turn, can be easily 
shown to be expressible as 


= [ [Teo do,...dtm| dQ (8) 


18) p(t |) )aQ(e) =f TI (2:18) 4Q(6| tn, 


=m+) 


8 icm4i 
where 


\ p(tm |8) dQ(6) 
Dba) Pitm 18) 4O(8) = Fe 8) aQ(a) 


dQ(9| tn) = 
It follows that 
Pra estes tata) = Pats io Elite) 


I] p(x; |@) dQ(@|x1,...,2m), 


t=m+1 


if, and only if, dQ(@|21,....2m) = dQ(@|tm) for alldQ(@).  g 


To make further progress, we now establish that parametric sufficiency is itself 
equivalent to certain further conditions on the probability structure. 


Proposition 4.10. (Neyman factorisation criterion). The sequence t,,t2,... 


is parametric sufficient for infinitely exchangeable x,,x2,... admitting a 
finitely parametrised mixture representation if and only if, for any m > 1, 
the joint density for r,,...,2m given @ has the form 


P(21.--- 12m | A) = Rm(tmsO)9(t1,--- 1m); 


for some functions h,, > 0.g > 0. 
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Outline proof. Given such a factorisation. for any dQ(@) we have 


dQ(O [21.02.02 = ie — H(t, 8)dQ(9) 
; m) ~ fop(tt...-- Tn |0)dQ(O) — fig Ain (tn, 0)dQ(8) 
for some h,, > 0. The right-hand side depends on x)... ...2,,, only through t,,, 


and, hence, dQ(0 | x1.....2,) = dQ(@|t,,,). Conversely, given parametric suffi- 
ciency, we have, for any dQ(@) with support 0, 


Pts? «1» Brn O)AQ(O) 


P(r) eee ) = dQ(@ | os Ln) 
P(t | 0)dQ(@) 
= dQ(0 t,,) = —————_——- 
ee P(tm) 
so that 
P(r wit dees Lin | 0) = Aun (tm. O)g(z) asttines a Ti) 


for some h,, > 0.g > Oas required. g 


Proposition 4.11. (Sufficiency and conditional independence). 
The sequence t,,t2,... is parametric sufficient for infinitely exchangeable 


X1.22,-.. if, and only if, for any m > |, the density p(x... -. tm | O.ty) is 
independent of 0. 
Outline proof. For any t,,, = t,,(%)...-- I) we have 
p(t1...-,2m| 8) = p(zy..... Im | O.tm)p(tm | 0). 
If p(r1,....%m |, t») is independent of @. the parametric sufficiency of t;.t».... 


follows immediately from Proposition 4.10. 
Conversely, suppose that ¢;.¢2.... is parametric sufficient. so that, by Propo- 
sition 4.10. 


p(T). cera ly |\@)= =. Rin {ty 0)9 (1, steed r i) 


for some h,, > 0.g > O. Integrating over all values {:r)......,,,} such that 
t,, (2, a Bierg Xin) = t,,,, we obtain 


p(t nt |@) = Ain (tin: 0)G(t m) 


for some G > 0. Substituting for h,,,(t,,,. 9) in the expression for p(.r).... .. ane 
we obtain ( } 
, oo = hl Sean 
P(t1..--- 2m |B) = P(tm 18) Ea 
so that a ) 
P(T.---. Lm |O.tm) = 7(t,,) 


which is independent of 8. 
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In the approach we have adopted, the definitions and consequences of pre- 
dictive and parametric sufficiency have been motivated and examined within the 
general framework of seeking to find coherent representations of subjective be- 
liefs about sequences of observables. Thus, for example, the notion of parametric 
sufficiency has so far only been put forward within the context of exchangeable 
beliefs, where the operational significance of “parameter” typically becomes clear 
from the relevant representation theorem. 

In fact, however, as the reader familiar with more “conventional” approaches 
will have already realised, related concepts of “sufficiency” are also central to non- 
subjectivist theories. In particular, we note that the non-dependence of the density 
P(21.---. Lm | 8, En,) on O, established here in Proposition 4.11 as a consequence 
of our definitions, was itself put forward as the definition of a “sufficient statistic” 
by Fisher (1922), and the factorisation given in Proposition 4.10 was established 
by Neyman (1935) as equivalent to the Fisher definition. 

From an operational, subjectivist point of view, it seems to us rather myste- 
rious to launch into fundamental definitions about learning processes expressed 
in terms of conditioning on “parameters” having no status other than as “labels”. 
However, from a technical point of view, since our representation for exchange- 
able sequences provides, for us, a justification for regarding the usual (Fisher) 
definition as equivalent to predictive and parametric sufficiency, we can exploit 
many of the important mathematical results which have been established using 
that definition as a starting point. 


In the context of our subjectivist discussion of beliefs and models, we shall 
mainly be interested in asking the following questions. 

When is it coherent to act as if there is a sequence of predictive sufficient 
statistics associated with an exchangeable sequence of random quantities? 

What forms of predictive probability model are implied in cases where we can 
assume a sequence of predictive sufficient statistics? 

Aside from these foundational and modelling questions, however, the results 
given above also enable us to check the form of the predictive sufficient statistics 
for any given exchangeable representation. We shall illustrate this possibility with 
some simple examples before continuing with the general development. 


Example 4.5. (Bernoulli model). We recall from Proposition 4.1 that if 7,.22,... 
is an infinitely exchangeable sequence of 0-1 random quantities, then we have the general 
representation 


i 
P(t1,---- Le) =| P(t1,..-,2n | 9)dQ(0) 
io” 


= | [[ Br(z. |@)4Q(6) 
0 i=] 


ay 
= [ a (1 — 8)" "dQ(8), 
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where s, = x21 +++: +2, Defining t,, = {n. s,,] and noting that we can write 
P(a1....-2n|O) = hy (tn P)g(ai.. 6. tn). 


with 
hy (ty) = 6" (1 — BY" glay. ees catn) = 1. 


it follows from Propositions 4.9 and 4.10 that the sequence t,.t,.... is predictive and 
parametric sufficient for r,..22,.... This corresponds precisely to the intuitive idea that the 
sequence length and total number of I's summarises all the interesting information in any 
sequence of observed exchangeable 0-1 random quantities. 


Example 4.6. (Normal model). We recall from Proposition 4.5 that if r)..ry.... is 
an exchangeable sequence of real-valued random quantities with the additional property of 
centred spherical symmetry then we have the general representation 


P(y...-. 2.) = | [ p(ty..-.. In| fe AQUA) 


[. | ; T]x [42.A)dQ(y2. ) 


2 


a [ I (A) eo{-e-m ~1) } dQ(ut. d) 
ie is (A) oof . Fbn(2, ~ pn)" + nstj} dQ(s. 2) 


. 1 2 1 
ae » =i De- 


=! 


i 


where 


In the light of Propositions 4.10 and Proposition 4.11, inspection of p(iy......2,| 4.) 
reveals that 
t, = (n. 2,8") 


defines a sequence of predictive and parametric sufficient statistics for 7;..2..... In view 
of the centring and spherical symmetry conditions, it is perhaps not surprising that the 
sample size, mean and sample mean sum of squares about the mean turn out to be sufficient 
summaries. Of course, ¢,, is not unique; for example, since 


3 ? a3 - ? 
ns, = Kyte ta net, 


we could equally well define t,, = [n.r,.7'(2? + --- +.r2)] as the sequence of sufficient 
statistics. 
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Example 4.7, (Exponential model). We tecall from Proposition 4.7 that if x,,-2,... 
is an exchangeable sequence of positive real-valued random quantities with an additional 
“origin invariance” property, then we have the general representation 


pty... 6. 2n) = [me nes x, | A)dQ(8) 


= [Texte tage) 


oa] 


= ie 8” exp(—8s, )dQ(@) 


where s, = x, +--: + x, Again, it is immediate from Propositions 4.10 and 4.11 that 
t, = [n. s,]| defines a sequence of predictive and parametric sufficient statistics, although, 
in this example, there is not such an obvious link between the form of invariance assumed 
and the form of the sufficient statistic. 


It is clear from the general definition of a sufficient statistic (parametric or 
predictive) that t,(21,...,2n) = [n.(z1,...,2n)] is always a sufficient statistic. 
However, given our interest in achieving simplification through data reduction, it is 
equally clear that we should like to focus on sufficient statistics which are, in some 
sense, minimal. This motivates the following definition. 


Definition 4.9. (Minimal sufficient statistic). Ifx,.x2,..., isan infinitely ex- 
changeable sequence of random quantities, where x, takes values in X; = X, 
the sequence of statistics t,,t2,..., with t; defined on X, x ... x Xj, is min- 


imal sufficient for x, 2,... if given any other sequence of sufficient statis- 
tics, 81, 82,..., there exist functions g,(-),Qo(-).... such that t; = g;(s;), 
i=1,2,... 


It is easily seen that the forms of t(z) identified in Examples 4.5 to 4.7 are 
minimal sufficient statistics. From now on, references to sufficient statistics should 
be interpreted as intending minimal sufficient statistics. 

Finally, since n very often appears as part of the sufficient statistic, we shall 
sometimes, to avoid tedious repetition, omit explicit mention of n and refer to the 
“interesting function(s) of 7,,..., 2,” as the sufficient statistic. 


4.5.3 Sufficiency and the Exponential Family 


In the previous section, we identified some further potential structure in the general 
representation of joint densities for exchangeable random quantities when predic- 
tive sufficiency is assumed. We shall now take this process a stage further by 
examining in detail representations relating to sufficient statistics of fixed dimen- 
sion. 
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Since we have established, in the finite parameter framework, the equivalence 
of predictive and parametric sufficiency for the case of exchangeable random quan- 
tities, and their equivalence with the factorisation criterion of Proposition 4.11, we 
shall from now on simply use the term sufficient statistic, without risk of confusion. 

We begin by considering exchangeable beliefs constructed by mixing, with 
respect to some dQ(@). over a specified parametric form 


P(t... ry 0) = J] pix, |8). (ai... tn) € X" CR" 


where @ is a one-dimensional parameter. By Proposition 4.10, if the form of p(z | @) 
is such that p(xz)....,2n[|6@) factors into h,(t,.O)g(z1..... rn), for some h,,. 9. 
the statistic ¢, = ¢,(z)...., ,£,,) would be sufficient. An pedi class of such 
p(.c |) is identified in the following definition. 


Definition 4.10. (One-parameter exponential family). A probability density 
(or mass function) p(x |), labelled by 9 € © C R, is said to belong to the 
one-parameter exponential family if it is of the form 


p(v|6) = Ef(z| f.g.h.0,6.¢) = f(r)g(@) exp{co(@a(r)}. x EX. 
where, given f.h.o, and c.{g(@)\"' = fy f(x) exp{co(@)h(x)}dr < x. 


The family is called regular if X does not ee on 0; otherwise it is called 
non-regular. 


Proposition 4.12. (Sufficient statistics for the one-parameter exponential 
family). Ifay.v2.....01, € X. is an exchangeable sequence such that, given 
regular Ef(-|-), 


PCr. ln) = i. I] Ef(a; | f.g.h.o. 6. e)dQ(@). 
any 


jor some dQ(@), then t,, = t,,(.ry...... ty) = [n-h(ry) +++ + ACen, )). for 
n= 1,2..... is a sequence of sufficient statistics. 
Proof. This follows immediately from Proposition 4.10 on noting that 


I] Ef(z; | f.g.h.9.8.¢) = I] fri) - [g(O)}" exp {eon PS h(x, } 
pe] tl 


rod 
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The following standard univariate probability distributions are particular cases 
of the (regular) one-parameter exponential family with the appropriate choices of 
f,g, etc. as indicated. 


Bernoulli 
p(x | 0) = Br(z|@) =67(1—6)'*, xe {0,1}, @€ [0,1]. 


f(z)=1, g()=1-80. h(x)=x, (6) =log » c=l, 


0 
1-06 


Poisson 


xr 
p(x |9) = Po(x|0) = 7s. re€{0,1,2,...}, OER. 


f(z) =(z!)7', g(@)=e°*, h(x) =z, 0(0)=log@, c=. 


Exponential 
p(x |@) = Ex(z|0) =6e"**, xeERrt, OER. 
f(z)=1, g(9)=8, Rh(z)=2, 9(6)=8 c=-1. 


Normal (variance unknown) 
p(x| 8) = N(x | 0,6) = (6/(27))" exp[—- 627], rER, BER. 
f(z) =(20)-¥?, g() =0'?, h(x) =2?, 0(0)=0. c= —1/2. 


We note that the term c@(@) appearing in the general Ef(- | -) form could always 
be simply written as 4*(@) with * suitably defined (see, also, Definition 4.11). 
However, it is often convenient to be able to separate the “interesting” function of 
0, 6(@), from the constant which happens to multiply it. 

In Definition 4.10, we allowed for the possibility (the non-regular case) that 
the range, X,, of possible values of 2 might itself depend on the labelling parameter 
9. Although we have not yet made a connection between this case and forms of 
representation arising in the modelling of exchangeable sequences, it will be useful 
at this stage to note examples of the well-known forms of distribution which are 
covered by this definition. We shall indicate later how the use of such forms in the 
modelling process might be given a subjectivist justification. 


Uniform 
p(z|8) = U(r]0,0)=07', +E (0,0), OER. 
f(z) =1. 9(0)=07, A(z) =0, 9(0)=6, cHl. 
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Shifted exponential 
piv |@) = Shex(x|0) = exp[-(r7 -@)}.  r-AERT. OER. 


fase". gG@=c hin =0. oO) =O. c= 1. 
In order to identify sequences of sufficient statistics in these and similar cases. 
we make use of the factorisation criterion given in Proposition 4.10. 
For the uniform, we rewrite the density in the form 


pr 0) = A Lagi(a). rer, 


so that, for any sequence :t,......",, which is conditionally independent given 6. 
PUT... |O) = TI. |) 
il 
=6@" Lao( amax {1}). Gtizedex r,) ER". 
It then follows immediately from Proposition 4.10 that 
t, =t,(ay..-..8,) = [n. ax (07 . n=l2.... 


is a Sequence of sufficient statistics in this case. 


For the shifted exponential, if we rewrite the density in the form 


p(e| 9) = exp[—(a — 9)] J. (4). rer. 


provides a sequence of sufficient statistics. 


The above discussion readily generalises to the case of exchangeable se- 
quences generated by mixing over specified parametric forms involving a 4-di- 
mensional parameter @. 
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Definition 4.11. (k-parameter exponential family). A probability density (or 


mass function) p(x|@), x € X, which is labelled by 9 € © C R*, is said to 
belong to the k-parameter exponential family if it is of the form 


k 
p(x | 0) = Efs (| f,9,h, 6,8,c) = f(x)9(@) exp {3 conte) ' 


i=1 


where h = (hi,...,he), (8) = (d1,-.-, Qk) and, given the functions 
f,h, ®, and the constants c;, 


i =f jiajem {¥o6(0) wh ar <o 


The family is called regular if X does not depend on 0; otherwise it is called 
non-regular. 


Proposition 4.13. (Sufficient statistics for the k-parameter exponential fam- 
ily). If x1, 2%2,...,2; € X, is an exchangeable sequence such that, given 
regular k-parameter Ef,,(-|-), 


pleis---stn) = fL TL Bti(a:| ft, ,0,0)d0(0), 
i=1 


for some dQ(@), then 


ty =tn(01,..-,2 In) = fn Somte.. “Dhl Zi) yn 1,2 goes 


is a sequence of sufficient statistics. 


Proof. This is analogous to Proposition 4.12 and is a straightforward conse- 
quence of Proposition 4.10. g 


The following standard probability distributions are particular cases (the first 
regular, the second non-regular) of the k-parameter exponential family with the 
appropriate choices of f, g etc. as indicated. 
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Normal (unknown mean and variance) 
P(x | 8) = p(x | #7) = N(x | #7) 
172 Z 
= (=) exp [-$(2 - »)?| . rEeR pEeR. TER. 
In this case, k = 2 and 


f(x) = (2n)""?, go (6) = 7! exp[-try?)]. A(x) = (a.2°). 


@(@) =(tTh.T), cp Hl. cy = 1/2. 
so that t, = [n, ai. iL, 27]. nm = 1.2,... is a sequence of sufficient 
statistics. 
Uniform (over the interval (0; , @2]) 
p(x] 8) = p(x | 1,42) = U(x] 01.62) = (82 — 1), 
LE (A.A). OER 0-0 ER. 


In this case, 


f(z) =1, g(8) =(@2-0)"'. A(z) =0. 60) = (0,0). =m =0. 


t, = [n.min{z)..... Lah MAH ici.33 Pies Te 
is easily seen to give a sequence of sufficient statistics. 


The description of the exponential family forms given in Definitions 4.10 and 
4.11, is convenient for some purposes (relating straightforwardly to familiar ver- 
sions of parametric families), but somewhat cumbersome for others. This motivates 
the following definition, which we give for the general 4-parameter case. 


Definition 4.12. (Canonical exponential family). 
The probability density (or mass function) 


ply |p) = Cef(y|a,b.w) = a(y)exp{y'w — o()}. oye ¥. 
derived from Ef;(-|-) in Definition 4.11, via the transformations 
y=(y...-. Yk). wh = (v4.....0%). 


y= h,(zx), yy = 9, (8). eS Dew das k, 


is called the canonical form of representation of the exponential family. 
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Systematic use of this canonical form to clarify the nature of the Bayesian 
learning process will be presented in Section 5.2.2. Here, we shall use it to exam- 
ine briefly the nature and interpretation of the function 6(~), and to identify the 
distribution of sums of independent Cef random quantities. 


Proposition 4.14. (First two moments of the canonical exponential family). 
For y in Definition 4.12, 


E(y|p) = Vo), — V(y|) = V7b(a). 


Proof. It is easy to verify that the characteristic function of y conditional on 
w is given by 


E(exp{iu'y} |b) = exp{b(iu + p) — o(#)}, 


from which the result follows straightforwardly. g 


Proposition 4.15. (Sufficiency in the canonical exponential family). 
If yy.°++. Yn are independent Cef(y | a, b. x) random quantities, then 


n 
s= Soy, 
1=1 


is a sufficient statistic and has a distribution Cef(s | a'"), nb, p), where a!”) 
is the n-fold convolution of a. 


Proof. Sufficiency is immediate from Proposition 4.12. We see immediately 
that the characteristic function of is exp{nb(iu + w) — nb(w)}, so that the 
distribution of s is as claimed, where a'”” satisfies 


nb(w) = tog [ a's) exp{p's}ds. 


Examination of the density convolution form for n = 1, plus induction, establishes 
the form of a'"?. 


Our discussion thus far has considered the situation where exchangeable belief 
distributions are constructed by assuming a mixing over finite-parameter ex ponen- 
tial family forms. A consequence is that sufficient statistics of fixed dimension 
exist. Moreover, classical results of Darmois (1936), Koopman (1936), Pitman 
(1936), Hipp (1974) and Huzurbazar (1976) establish, under various regularity 
conditions, that the exponential family is the only family of distributions for which 
such sufficient statistics exist. 
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In the second part of this subsection, we shall consider the question of whether 
there are structural assumptions about an exchangeable sequence 71..2..... which 
imply that the mixing must be over exponential family forms. 

Previously, in Section 4.4, we considered particular invariance assumptions, 
which, together with exchangeability, identified the parametric forms that had to 
appear in the mixture representation. Here, we shall consider. instead, whether 
characterisations can be established via assumptions about conditional distribu- 
tions, motivated by sufficiency ideas. 

Asa preliminary, suppose for a moment that an exchangeable sequence, {y;}. 
is modelled by 


Now consider the form of p(y... -. y,l¥r t+: +y, = 8), 4 < n. Because 
of exchangeability, this has a representation as a mixture over 


P(Y)--+ +s ye lB too + yy, = 38). 


But the latter does not involve w because of the sufficiency of y,; +--- + y,, 
(Propositions 4.11 and 4.15), so that 


P(UYs.- ++, Yel yi = 8) = PY: YI = $v) 


al"“*lg ee S;) exp{w'(s _ 8) - (n a! k:)b(a) } 
a!")(s) exp{p's — nb(p)} 


: 
= [[e(y,) exp{o'y, - o(¥)} 
t=) 


where, in the numerator, 8, = y, +°:: + y, < 8. The exponential family mixture 
representation thus implies that, 


k (nk) 
& Be — 2y — Tin a(yia (8 — 8%) 
PQ a) | 1d = a alt)(s8) , 


Now suppose we consider the converse. If we assume y,.Y.... to be ex- 
changeable and also assume that, for all » and & < n, the conditional distributions 
have the above form (for some a defining a Cef(y|a.b.w) form), does this im- 
ply that p(y;.---.y,,) has the corresponding exponential family mixture form? 
A rigorous mathematical discussion of this question is beyond the scope of this 
volume (see Diaconis and Freedman, 1990). However, with considerable licence 
in ignoring regularity conditions, the result and the “flavour” of a proof are given 
by the following. 
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Proposition 4.16. (Representation theorem under sufficiency). 
Ify,:Yo,... is any exchangeable sequence such that, for alln > 2andk <n, 


k 
P(Mis Yel to + Un = 8) =a a(y;ja""“)(s — s4)/a'"!(s), 
where 8, = y,; +---+ y, and a(-) defines Cef(y | a,b. wy), then 


pvie--sun) = f T[ Cetty; 0.0. v)dQ(v), 
ix] 


for some dQ(2). 


Outline proof. We first note that exchangeability implies a mixture represen- 
tation, mixing over distributions which make the y; independent. But each of the 
latter distributions, with densities denoted generically by f, themselves imply an 
exchangeable sequence, so that, forn > 2,4 <n, f(y)..--.Yx|Yit:--+¥n = 8) 
also has the specified form in terms of a(-). 

Now consider n = 2,k = 1. Independence implies that 


fw) f(s— 1) | 
f(a) 


where f(-) denotes the marginal density and f'?)(-) its twofold convolution, so that 
f(-) must satisfy 


f(mlyi ty =s8)= 


f(yf(s-yi) _ a(y,)a(s — 1) ; 
£°)(s) ~  al(s) 


If we now define 


f(y) | f(0) 


u(y,) = log ~—~ ig oo a0) 
re r (s) £(0) 
u(s) = log (8) 2108 (0) 


it follows that 
u(y,) + u(s — y,) = v(8). 
Setting y, = s, and noting that u(0) = 0, we obtain u(s) = v(s), and hence 
u(y) + u(y) = u(y + Ye)- 
This implies that u(y) = ‘y, for some 7%, so that 
f(y) = a(y)exp{p'y — b()}. d 


The following example provides a concrete illustration of the general result. 
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Example 4.8. (Characterisation of the Poisson model). Suppose that the sequence 
of non-negative integer valued random quantities y.y..... is judged exchangeable. with 
the conditional distribution of y = (y)..... yy) given yi wey, SE DDE SH, 
specified to be the multinomial Mu,(y | s.@). where @ = (1:1.---. 1/1), so that 


kh 


ye s! ry\” h\o 
P(yy..- + YWly tort yy = 4s) = aearromen a Cer tas . 


i WS — 8a) 


where 4, = 4; +---+ yx. Noting that the Poisson distribution. Pn(y ; v'), can be written in 
Cef(y jab. ¢*) form as 


] 
Paty |e) = exp {yer —e* } = atydexpl yer ~ Me} 


from which it easily follows that a!"’(s) = n° /s!, it is straightforward to check that, in terms 
of a{-) and a!’"(-), 


k 


Mily|s.0) = TT a(yat (8 = ay )/a''(s). 


rel 


By Proposition 4.16. it follows that the belief specification for yj. y2.... is coherent and 
implies that 


pion---tod =f TE Pata levee. 
Pal aed 


for some dQ(wv),w ER. 4 


As we remarked earlier, the above heuristic analysis and discussion for the /- 
parameter regular exponential family has been given without any attempt at rigour. 
For the full story the reader is referred to Diaconis and Freedman (1990). Other 
relevant references for the mathematics of exponential families include Barndorff- 
Nielsen (1978), Morris (1982) and Brown (1985). 

We conclude this subsection by considering, briefly and informally, what can 
be said about characterisations of exchangeable sequences as mixtures of non- 
regular exponential families. For concreteness, we shall focus on the uniform, 
U(x | 0.6), distribution, which has density 97 I,,9)(r). 2 € Rand sufficient statis- 
tic max{z).....2,}, given a sample x,,....4 r,. This sufficient statistic is clearly 
not a summation, as is the case for regular families (and plays a key role in Proposi- 
tion 4.16). However, conditional on m,, = max{.ry.... ty ftp... wpe ks 
are approximately independent C'(.c; |0.77,,) and this will therefore be true for 
all exchangeable 1,..72.... constructed by mixing over independent lL? (.r, | 0. 8). 
Conversely, we might wonder whether positive exchangeable sequences having 
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this conditional property are necessarily mixtures of independent U(z; | 0,9). In- 
tuitively, if m,, tends to a finite 6 from below, as n — oc, one might expect the 
result to be true. This is indeed the case, but a general account of the required math- 
ematical results is beyond our intended scope in this volume. The interested reader 
is referred to Diaconis and Freedman (1984), and the further references discussed 
in Section 4.8.1. 


4.5.4 Information Measures and the Exponential Family 


Our approach to the exponential family has been through the concept of predictive 
or, equivalently, parametric sufficient statistics. It is interesting to note, however, 
that exponential family distributions can also be motivated through the concept of 
the utility of a distribution (c.f. Section 3.4), using the derived notions of approxi- 
mation and discrepancy. 

Consider the following problem. We seek to obtain a mathematical represen- 
tation of a probability density p(x), which satisfies the & (independent) constraints 


| hi(ax)p(z)dz =m; < 0, i=1,....k, 
XxX 


where m),... , 7, are specified constants, together with the normalizing constraint 
fy p(x)dx = 1, and, in addition, is to be approximated as closely as possible by a 
specified density f(x). 

We recall from Definition 3.20 (with a convenient change of notation) that the 
discrepancy froma probability density p(2:) assumed to be true of an approximation 
f(x) is given by 


p(z) 
spe) = | (2) log EE ae. 
where f and p are both assumed to be strictly positive densities over the same range, 
X, of possible values. Note that we are interested in deriving a mathematical repre- 
sentation of the true probability density p(x), not of the (specified) approximation 
f(x). Thus, we minimise 6(f | p) over p subject to the required constraints on p, 
rather than 6(f | p) over f subject to constraints on f. Hence, we seek p to minimise 


F(p) = I p(x) log Fade 


k 
+ 24 if. hj(x)p(x)dx — nm +¢ if. p(2)dzr — | ; 


where @,,..., 6, and ¢ are arbitrary constant multipliers. 
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Proposition 4.17. (The exponential family as an approximation). 
The functional F (p) defined above is minimised by 


p(v) = Efp(ar| fig. hid. @.c).7 eX 


where f and hare given in F(p).c, = 1.¢=60 = (@..... 8;.) and 


l k 
q(0) = 7 f(r)exp {rama} dr. 


Proof. By a standard variational argument (see. for example, Jeffreys and 
Jeffreys, 1946, Chapter 10), a necessary condition for p to give a stationary value 
of F(p) is that 


(0/Oa) F (p(r) + ar(r)) | aco = 0 


for any function 7 : x — of sufficiently small norm. This condition reduces to 
the equation 


bh 
{ exter ste aa Se eihi(e) ee "| T(x}dr = 0. 
x c=] 


from which it follows that 


b 
p(r) x f(r)exp 2 ance} . 


as required. (For an alternative proof, see Kullback. 1959/1968. Chapter 3.) 


The resulting exponential family form for p(:r) was derived on the basis of a 
given approximation f (2) and a collection of “constant” functions A(z) = [h,(x). 
..., hy(x)]. If we wish to emphasise this derivation of the family, we shall refer to 
Ef(z | f. 9, h. 0.0, €) as the exponential family generated by f and h. 

In general, specification of the sufficient statistic 


ty = So he Rhee aah So hela) 
fet j 


does not uniquely identify the form of f(.c) within the exponential family frame- 
work. Consider, for example, the Ga(x | a, @) family with a known. Each distinct 
a defines a distinct exponential family with density 


(2° 1/1 (a))0" exp{ 02}. 
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so that, in addition to h(x) = z, we need to specify f(x) = r°~'/T'(a@) in order to 
identify the family. 

Returning to the general problem of choosing p to be “as close as possible” to 
an “approximation” f, subject to the k constraints defined by A(z), it is interesting 
to ask what happens if the approximation f is very “vague”, in the sense that f is 
extremely diffusely spread over X. A limiting form of this would be to consider 
f(z) = constant, which leads us to seek the p minimising [, p(x) log p(x)dr 
subject to the given constraints. The solution is then 


exp { Doran Bihi(z) } 


p(x) = ——— 
Jy exp pee a,hi(z) } dz 

which, since minimising fy p(x) log p(x) dz is equivalent to maximising H(p) = 

— fy p(x) log p(x) dz, is the so-called maximum entropy choice of p. 

Thus, for example, if X = R+ and A(x) = x, the maximum entropy choice for 
p(x) is Ex(x |), the exponential distribution with @~' = E(x |). If X = ® and 
h(x) = (x, 27), the maximum entropy choice for p(x) turns out to be N(z | 2, A), 
the normal distribution with yp = E(x |, A),A7! = V(2| p, A) (c.f. Example 3.4, 
following Definition 3.20). 

Our discussion of modelling has so far concentrated on the case of beliefs 
about a single sequence of observations z,, r22,..., judged to have various kinds 
of invariance or sufficiency properties. In the next section, we shall extend our 
discussion in order to relate these ideas to the more complex situations, which arise 
when several such sequences of observations are involved, or when there are several 
possible ways of making exchangeable or related judgements about sequences. 


4.6 MODELS VIA PARTIAL EXCHANGEABILITY 
4.6.1 Models for Extended Data Structures 


In Section 4.5, we discussed various kinds of justification for modelling a sequence 
of random quantities x|,22,... as a random sample from a parametric family 
with density p(x |@), together with a prior distribution dQ(@) for @. We also 
briefly mentioned further possible kinds of judgements, involving assumptions 
about conditional moments or information considerations, which further help to 
pinpoint the appropriate specification of a parametric family. 

However, in order to concentrate on the basic conceptual issues, we have 
thus far restricted attention to the case of a single sequence of random quantities, 
X1,22,..., labelled by a single index, i = 1,2,..., and unrelated to other random 
quantities. Clearly, in many (if not most) areas of application of statistical modelling 
the situation will be more complicated than this, and we shall need to extend and 
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adapt the basic form of representation to deal with the perceived complexities of 
the situation. Among the typical (but by no means exhaustive) kinds of situation 
we shall wish to consider are the following. 
(i) Sequences x,;, 2;2,... of random quantities are to be observed in each of i € J 
contexts. Forexample: we may have sequences of clinical responses to each of 
I different drugs; or responses to the same drug used on / different subgroups 
of a population. A modelling framework is required which enables us to learn, 
in some sense, about differences between some aspect of the responses in the 
different sequences. 


(ii) In each of i € J contexts, j € J different treatments are each replicated 
k € K times, and the random quantities .x; ;;. denote observable responses for 
each context/treatment/replicate combination. For example: we may have | 
different irrigation systems for fruit trees, J different tree pruning regimes and 
K trees exposed to each irrigation/pruning combination, with .x;;; denoting 
the total yield of fruit in a given year; or we may have J different geographical 
areas, .] different age-groups and A’ individuals in each of the J.J combina- 
tions, with x;,, denoting the presence or absence of a specific type of disease. 
or a coding of voting intention, or whatever. A modelling framework is re- 
quired which enables us to investigate differences between either contexts. or 
treatments, or context/treatment combinations. 

(iii) Sequences of random quantities .r,)..0,2..... i € I, are to be observed, where 
some form of qualitative assumption has been made about a form of relation- 
ship between the .r,, and other specified (controlled or observed) quantities 
Z, = (24.....24),8 > 1. Forexample: x,, might denote the status (dead or 
alive) of the jth rat exposed to a toxic substance administered at dose level z;. 
with an assumed form of relationship between =z, and the corresponding “death 
rate”; or 2;; might denote the height or weight at time =, from the jth replicate 
measurement of a plant or animal following some assumed form of ’ growth 
curve”, or x;, might denote the output yield on the jth run of a chemical 
process when K inputs are set at the levels z, = (2))..... zy,) and the gen- 
eral form of relationship between process output and inputs is either assumed 
known or well-approximated by a specified mathematical form. In each case 
a modelling framework is required which enables us to learn about the quanti- 
tative form of the relationship, and to quantify belicfs (predictions) about the 
observable 2°* corresponding to a specified input or control quantity 2°. 

(iv) Exchangeable sequences, w))..0i2..-.. of random quantities are to be observed 
in each of 7 € J contexts, where J is itself a selection from a potentially larger 
index set J”. Suppose that for each sequence, 


tH) = bm, aos) . ?e”, 
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is judged to be a sufficient statistic, that the strong law limits 
. 1 (i) . 
#, = lim — th’, ied, 
mx TM 


exist and that the sequence 6, 02, .. . is itself judged exchangeable. For exam- 
ple: sequence 7 may consist of 0 — 1 (success-failure) outcomes on repeated 
trials with the ith of J similar electronic components; or sequence 7 may con- 
sist of quality measurements of known precision on replicate samples of the 
ith of J chemically similar dyestuffs. In the first case, the sequence of long-run 
frequencies of failures for each of the components might, a priori, be judged to 
be exchangeable; in the second case, the sequence of large-sample averages of 
quality for each of the dyestuffs might, a priori, be judged to be exchangeable. 
A modelling framework is required which enables us to exploit such further 
judgements of exchangeability in order to be able to use information from all 
the sequences to strengthen, in some sense, the learning process within an 
individual sequence. 


4.6.2 Several Samples 


We shall begin our discussion of possible forms of partial exchangeability judge- 
ments for several sequences of observables, 2;,, 2;2..-.,4 = 1,...,m, by consid- 
ering the simple case of 0 — | random quantities. 

In many situations, including that of a comparative clinical trial, joint beliefs 
about several sequences of 0 — | observables would typically have the property 
encapsulated in the following definition, where, here and throughout this section, 
a,(n;) denotes the vector of random quantities (rj;...., Ein, ). 


Definition 4.13. (Unrestricted exchangeability for 0 — I sequences). 
Sequences of 0 — 1 random quantities, 2, Tig,.-.,1 = 1,...,m, are said 
to be unrestrictedly exchangeable if each sequence is infinitely exchangeable 
and, in addition, for all n; < N,,i = 1,...,7, 


p(ai(r%1).-- 52m (Mm) |yi(Ni)s-+-1¥m(Non)) = [ [plas (m) |yi(Ni)), 


f=1 
where yi(Ni) = 2a +°°°+ Hin, t= 1,...,m. 


In addition to the exchangeability of the individual sequences, this definition 
encapsulates the judgement that, given the total number of successes in the first 
N; observations from the ith sequence, i = 1,...m, only the total for the ith 
sequence is relevant when it comes to beliefs about the outcomes of any subset 
of n; of the N; observations from that sequence. Thus, for example, given 15 
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deaths in the first 100 patients receiving Drug 1 (NV; = 100, yi(Ni) = 15) and 
20 deaths in the first 80 patients receiving Drug 2 (Nz = 80, yo(N2) = 20), we 
would typically judge the latter information to be irrelevant to any assessment of the 
probability that the first three patients receiving Drug | survived and the fourth one 
died (21; = 0. ry2 = 0.2443 = O..r14 = 1). Of course, the information might well 
be judged relevant if we were not informed of the value of y,(V;). The definition 
thus encapsulates a kind of “conditional irrelevance” judgement. 


As an example of a situation where this condition does nor apply. suppose that 
X11,.212,--. is an infinitely exchangeable 0-1 sequence and that another sequence 
Xoy,Tg.... 8 defined by rz, = 241, (or by x2, = 1 — ae Then 22). .229.... 18 
cehatily a an exchangeable sequence (since 211..0j2.... is). but, taking 12, = 3); 
and ny = = Ny = Ns =, 


Pty =O.r12 = berg = Lory = Ol ye = lyn = 1) = 
whereas 
P(r = 0, :r)2 =1 we = 1) p(r2, = 1. P29 =) Yo = 1)= 1/2 x 1f2= 1/4. 


Further insight is obtained by noting (from Definition 4.13) that unrestricted ex- 
changeability implies that 


near have ae Tiny tae tated ppl cate OR Dieta) 
= P(®1z,() erat Pinp(ay) widhaee Ding (hyeceeee lineata) 
for any unrestricted choice of permutations 7, of {1..... n)},i = L....m. whereas, 


in the case of the above counter-example. we only have invariance of the joint 
distribution when 7, = 7. For a development starting from this latter condition 
see de Finetti (1938). 


We can now establish the following generalisation of Proposition 4.1. 


Proposition 4.18. (Representation theorem for several sequences of 0-1 
random quantities). [f rj, ..0;2..... i=... m are unrestrictedly infinitely 
exchangeable sequences of 0-1 random quantities with joint probability mea- 
sure P, there exists a distribution function Q such that 


me uy 


p(a1(m),....2@m(Mm)) = =| I] [lea-6 )-*7dQ(@) 
Us ae ie 


pel 
where, with y,(nj) = Zip +--+ + Fin,.t = Lee... my 


Q(0) = lim p | (Bet) <6) nn (He? <a.) 


alln;—-x ny Thy 
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Corollary. Under the conditions of Proposition 4.18, 


P(yi(m1), see +Y¥m(Nm)) 


(, ( Jerr — 8,)"-¥"dQ(B1.-. sm) 


7 3 I] (ni 


te 


Proof. We first note that 


ny Rn 
My )yeeey Ym(MQm)) = wd awi(m),..., z, 
plvattae--sdialm)) = (4,69) (tig) PEED alti) 
so that, to prove the proposition, it suffices to establish the corollary. Moreover, 
for any N; > n;,i = 1,...,m, we may express p(yi(71).--- Ym(N»)) as 


S>p(ya(r), vee Ym{Mm) | yi(Ni),--- + Yn(Nm)) P(ywi(™1). ee 1Yin(Nm))s 


where the ith of the mm summations ranges from y,(Ni) = yi(n.) to y(Ni) = Ni 
and where, by Definition 4.4 and a straightforward generalisation of the argument 
given in Proposition 4.1, 


P(yi(m1). cai 1 Ym(Nn) lyi(M1), ote Ymn(Nin)) 


~ I P(yi(m) Ly) = I (wie) € GY: 


Writing ely = yn(yn — 1)--- (yw — (yn — 1)), etc., and defining the func- 


tion Qu... ee ..++Om) on R”™ to be the m-dimensional “step” function with 
“jumps’ "of pal yacht Nin)) Bt 
- yi(M1) _Yn(Nmn) 
scuheel i —N, 


where yi(N,) = 0,....Ni,i = 1,...,m. We see that p(yi(m1)..... Ym(Mm)) is 
equal to 


m ON; in 1-8, N; WT aitia. 
= I { a ete tee dQ... Nin (9 ). 


As Nj...., Nin > OC, 


3 (O:Ni)y, nj re as 9:)NiJnj- i (ry) mT guilt: Rn, —y,j (et 
II{ a (Ni) - \ [[@%"? (= a), 
Pa i}n; 


wl] 
uniformly in 6,,..., A, and, by the multidimensional version of Helly’s theorem 


(see Section 3.2.3), there exists a subsequence Qyj,(j)....Nn(j))Jd = 1,2,--. having 
a limit Q, which is a distribution function on R". The result follows. 
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Considering, for simplicity, 72 = 2, Proposition 4.18 (or its corollary) asserts 
that if we judge two sequences of 0 — | random quantities to be unrestrictedly 
exchangeable, we can proceed as if. 


(i) the x,; are judged to be independent Bernoulli random quantities (or the y,(7,) 
to be independent binomial random quantities) conditional on random quan- 
tities 8,.7 = 1.2: 

(ii) (@;. 02) are assigned a joint probability distribution Q; 

(iii) by the strong law of large numbers, #, = lim, (y,(7,)/n,). So that Q may 
be interpreted as “joint beliefs about the limiting relative frequencies of I's in 
the two sequences”. 


The model is completed by the specification of dQ(@;. 02), whose detailed 
form will. of course, depend on the particular beliefs appropriate to the actual 
practical application of the model. At a qualitative level, we note the following 
possibilities: 

(a) knowledge of the limiting relative frequency for one of the sequences would 
not change beliefs about outcomes in the other sequence, so that we have the 
independent form of prior specification, dQ(@;.42) = dQ(0,)dQ(62): 
the limiting relative frequency for the second sequence will necessarily be 
greater than that for the first sequence (due, for example, to a known improve- 
ment in a drug or an electronic component under test), so that dQ(6,. 42) is 
zero outside the range 0 < 6; < 02 < 1: 


(b 


~ 


(c 


—_ 


there is a real possibility, to which an individual assigns probability 7. say. 
that, in fact. the limiting frequencies could turn out to be equal, so that, writing 
0 = 9, = 9». in this case dQ(6@,. 92) has the form 


7dQ’ (0) + (1 ~ 7)dQ* (0. 63) 


and the representation, for (43,,1. ¥2.2). say. has the form 


“} 


ply (2y). Y2(N2)) = 7 / Bi(yy (221) | 01.8) Bi(yo(ry) | 2.8) dQ" (A) 


| ] 
+ (baa) ff Bidyr ton) ns) Bilton) na. 82) dQ” (00). 
4) ia) 


where dQ) ' (0). 02) assigns probability over the range of values of (0). 02) such 

that 6; 4 0». 

As we shall see later, in Chapter 5, the general form of representation of 
beliefs for observables defined in terms of the two sequences. together with detailed 
specifications of dQ(6,. 92), enables us to explore coherently any desired aspect of 
the learning process. For example. we may have observed that out of the first 2). 12 
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patients receiving drug treatments 1, 2, respectively, y;(7,) and yo(n2) survived, 
and, on the basis of this information, wish to make judgements about the relative 
performance of the drugs were they to be used on a large future sequence of patients. 
This might be done by calculating, for example, 


( tirm (y4(N)/N) — Jim (ye(N)/N) | yi (re1)s yo(e2)), 


which, in the language of the conventional paradigm, is the “posterior density for 
9, — 42, given yi(71), y2(n2)”. 

Clearly, the discussion and resulting forms of representation which we have 
given for the case of unrestrictedly exchangeable sequences of 0-1 random quan- 
tities can be extended to more general cases. One possible generalisation of Defi- 
nition 4.13 is the following. 


Definition 4.14. (Unrestricted exchangeability for sequences with predictive 
sufficient statistics). Sequences of random quantities x;,,2;2,... taking val- 


ues in X,,1 = 1,...,m, are said to be unrestrictedly infinitely exchangeable 
if each sequence is infinitely exchangeable and, in addition, for all n; < Ni, 
i= 1,...,m, 


mt 


p(xi(n),- os +Lm(Nm) | tn... stn, ) = [[ p(@i(n) | ta.) 


t=] 


where ty, = tn,(xi(N;)), i = 1,...,m, are separately predictive sufficient 
statistics for the individual sequences. 


In general, given m unrestrictedly exchangeable sequences of random quanti- 
ties, 2}, 2,2,..., with z;,; taking values in X,, we typically arrive at a representation 
of the form 


m ny 


p(@i(n),..-,Ln(m)) = IT] [] (es 14) dQ061,-.- 2). 


t=] j=! 


where O* = [];”, O; and the parametric families 


p({z|6), cEX,, 9:€90;), i=1,...,m, 


have been identified through consideration of sufficient statistics of fixed dimen- 
sion, or whatever, as discussed in previous sections. Most often, the fact that the 
& sequences are being considered together will mean that the random quantities 
i1.Zji2..-. relate to the same form of measurement or counting procedure for al] 
i = 1,...,m, so that typically we will have p;(z|9;) = p(z|,),i = 1,...,m, 
where the parameters correspond to strong law limits of functions of the sufficient 
statistics. The following forms are frequently assumed in applications. 


216 4 Modelling 


Example 4.9. (Binomial). If y;(n,) denotes the number of I's in the first 2, outcomes 
of the ith of 7 unrestrictedly exchangeable sequences of 0 — | random quantities. then 


Example 4.10. (Multinomial). If y,(1,) denotes the category membership count (into 
the first &: of k+1 exclusive categories) from the first 2, outcomes of the ith of 1) unrestrictedly 
exchangeable sequences of “0 ~ 1 random vectors” (see Section 4.3). then 


where 8, = lim,, .. (y,(1)/) and © = {@ = (8)..... 4.) suchthat0 <6, <Ll<i ck. 
and 6, +---+6; < 1}. This model describes beliefs about an in x (4 + 1) contingency table 
of count data, with row totals ,..... n,,. It generalises the case of the 1 x 2 contingency 
table described in Example 4.9. g 


Example 4.11. (Normal). If.+,,.j = 1..... u,.i = 1..... 1, denote real-valued ob- 
servations from m unrestrictedly exchangeable sequences of real-valued random quantities. 
the assumed sufficiency of the sample sum and sum of squares within each sequence might 
lead to the representation 


eed 


where. with z,,(f) = ‘(ry teeter, )and 9? (¢) = 07 S0" Ur, -.8,.())7. we have y= 
lim, —. %,(/).A,;' = lim, .. 62(/).6 = Grp... PaceNcotss A,,) and @ = RY" x (RO). 

In many applications. the further judgement is made that A, = --+ = A,, = A. say. so 
that the representation then takes the form 


This is the model most often used to describe beliefs about a one-way lavour of measurement 
data. g 


As in the case of 0 — 1 random quantities with 72 = 2, discussed earlier in 
this section, we could make analogous remarks concerning the various qualitative 
forms of specification of the prior distribution that might be made in these cases. 
We shall not pursue this further here. but will comment further in Section 4.7.5. 
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4.6.3 Structured Layouts 


Let us now consider the situation described in (ii) of Section 4.6.1, where the random 
quantity z;;, is triple-subscripted to indicate that it is the kth of K “replicates” of 
an observable in “context” i € J, subject to “treatment” 7 € J. In general terms, 
we have a two-way layout, having J rows and J columns, with K replicates in each 
of the JJ cells. 

In such contexts, most individuals would find it unacceptable to make a judge- 
ment of complete exchangeability for the random quantities 2; ;,. For example, if 
rows represent age-groups, columns correspond to different drug treatments, repli- 
cates refer to sequences of patients within each age-group/treatment combination 
and the 2;;, measure death-rates, say, it is typically not the case that beliefs about 
the x;;, would be invariant under permutations of the subscript 7. On the other 
hand, for the kinds of mechanisms routinely used to allocate patients to treatment 
groups in clinical trials, many individuals would have exchangeable beliefs about 
the sequence x;;1,2;;2,... for any fixed i, j. 

Technically, such a situation corresponds to the invariance of joint beliefs for 
the collection of random quantities, 2; ;;, under some restricted set of permutations 
of the subscripts, rather than under the unrestricted set of all possible permuta- 
tions (which would correspond to complete exchangeability). The precise nature 
of the appropriate set of invariances encapsulating beliefs in a particular applica- 
tion will, of course, depend on the actual perceived partial exchangeabilities in 
that application. In what follows, we shall simply motivate, using very minimal 
exchangeability assumptions, a model which is widely used in the context of the 
two-way layout. There is no suggestion that the particular form discussed has any 
special status, or ought to be routinely adopted, or whatever. 

Suppose that, for any fixed 2, j, we think of x;,;, 2,;2, .. . aS. a (potentially) infi- 
nite sequence of real-valued random quantities (z € 3), such that the J J sequences 
of this kind, with J and J fixed, are judged to be unrestrictedly exchangeable. If 
further assumptions of centred spherical symmetry or sufficiency for each sequence 
then lead to the normal form of representation, we have 


I 


J My 
plau(nn)r..-.2u(m)) = f I] I] [[ NGiie |louss) dQ(9), 
ON ror gato ea 


where # = (p11,---. fz, Als». Azz) and © = RY x (R*+)/4, so that conditional, 
for each (2, 7), on the strong law limits 


May = fim Kai +--+ 2yK) = lim Ko'ai(K), 


: 
(Aig)! = Jim KOS (run - 4(K))? = Jim s7,(K), 


k=] % 
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the x; ,, are assumed independently and normally distributed with means j,; and 
variances (X,;)~!. 

In many cases, the nature of the observational process leads to the judgement 
that lima —> $7;(A’) may be assumed to be the same for all (7. j)), so that A,, = A. 
say, for all 7, 7. Letting 


J J 
te = fim KIB) = Sy 


j=] jel 


I 1 
jim. KO peas ee DH 
f d 
(BK) =I ae = ITS be, 
re) y=) 


denote the strong law limits of the row averages, column averages and overall 
average, respectively, from the two-way layout with J and J fixed, we can always 
write 


fle, 


flee = 


! 
15 
Me 
>) 
ne 
{ 
4 
iM- 
M+ 
& 


fy, = ta; +3)+n,- 


where 
Cy = (fie — MH). Fy = (May — HE). jf = (Mey — Hie — Hey): 
so that the random quantities 2x; ;;, are conditionally independently distributed with 
P(r jk |e Or By. Fy A) = N(aiye [t+ a, +3; + 5.A). 


The full model representation is then completed by the specification of a prior 
distribution Q for 4 and any JJ linearly independent combinations of the j2;;. 
In conventional terminology. jc is referred to as the overall mean, a, as the ith 
row effect, ;3; as the jth column effect and +,, as the (ij)th interaction effect. 
Collectively, the {a,} and {,3,} are referred to as the main effects and {+,;} as the 
interactions. Interest in applications often centres on whether or not interactions or 
main effects are close to zero and, if not, on making inferences about the magnitudes 
of differences between different row or column effects. 

In the above discussion, our exchangeability assumptions were restricted to 
the sequence 1);),.0,,2.... for fixed i. j. It is possible, of course, that further 
forms of symmetric beliefs might be judged reasonable for certain permutations of 
the i,j subscripts. We shall return to this possibility in Section 4.6.5, where we 
shall see that certain further assumptions of invariance lead naturally to the idea of 
hierarchical representations of beliefs. 
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4.6.4 Covariates 


In (iii) of Section 4.6, we gave examples of situations where beliefs about sequences 
of observables 2;),2j2,...,2 = 1,...,m are functionally dependent, in some 
sense, on the observed values, z;,i = 1,...,m, of a related sequence of (random) 
quantities. We shall refer to the latter as covariates and, in recognition of this 
dependency, we shall denote the joint density of z,,,j =1,...,ni,i=1,....m, 
by 

p(#1(m),.--Lin(Mn) | Z1,---, 2m). 


The examples which follow illustrate some of the typical forms assumed in appli- 
cations. Again, there is no suggestion that these particular forms have any special 
status; they simply illustrate some of the kinds of models which are commonly 
used. 


Example 4.12. (Bioassay). Suppose that at each of mm specified dose levels, z,,..., 

Zm, Of a toxic substance, typically measured on a logarithmic scale, sequences of 0 - | 

random quantities, 7j,,Zi2...-. ¢ = 1,....m, are to be observed, where ei = 1 if the 

jth animal receiving dose z; survives, z;; = 0 otherwise. If, for each i = 1,....m, the 

sequences xj, 4,2,... are judged exchangeable, and if we denote the number of survivors 

out of n; animals observed in the ith sequence by y;(n,) = xj, + +--+ Zi,,,4 Straightforward 
generalisation of the corollary to Proposition 4.18 implies a representation of the form 


am 


pli(ma)s---tmtm)l2)= fo TT Bi(w(n,)|\(2).m1) dQ(0(2)) 
(1? jay 


where z = (z),.... 2m), 0(z) = (O1(z), .--. An (Z)) and 6, (z) = limn™'y;(n). 
nwxX 
In many situations, investigators often find it reasonable to assume that 


G(z) = O(z)) = G(9;2,), 


where the functional form G' (usually monotone increasing from 0 to 1) is specified, but ¢ is 
a random quantity. Functions having the form G(@: z,) = G(@, + ¢22;), with @) € R, dy € 
R*, are widely used (see, for example, Hewlett and Plackett, 1979), with 
a +092, 
G(d, + doz.) = / N(jz|0,1)du (the probit mode}) 


x 


and 


G(, + G22) = exp(; + G22;)/{1 + exp(di + d2z,)} (the logit model) 


being the most common. For any specified G(.; z;). the required representation has the form 


m 


p(y (m1), © «+ Yur(tm) [2) = [T1si(wer )|G(d;2),m.)dQ"(9). 
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with dQ‘(@) specifying a prior distribution for @ € ®. In practice, the specification of 
Q might be facilitated by reparametrising from ¢ to a more suitable (1-1) transformation 
W = (6). In the probit and logit cases, for example. «| = —)/@2 corresponds to the 
(log) dose, z;, at which G(d, + 622;) = 1/2. Beliefs about ¢, then correspond to beliefs 
about the (log-) dose level for which the survival frequency in a large series of animals would 
equal 1/2, the so-called LD50 dose. Experimenters might typically be more accustomed to 
thinking in terms of (—¢; /2. @2). say, than in terms of (0;.@2). 


Example 4.13. (Growth-curves). Suppose that at each of 1 specified time points, 
$ay 2),.-+, 2m, Sequences of real-valued random quantities, .r,)..0j2..... oe Parnes mi. are 
to be observed, where :’;, is the jth replicate measurement (perhaps on a logarithmic scale) 
of the size or weight of the subject or object of interest at time z,. Suppose further that the 
kinds of judgements outlined in Example 4.11 are made about the sequences wr,)..tj2..... 
with } = 1,...,2. so that we have the representation 


where O(z) = (yi(z)..... Pu (2). A(z)... Aw(z)) and O. = R” x (RT). 
In many such situations, the judgement is made that \,(z2) = -:- = A,,(z) = A 
(particularly if measurements are made on a logarithmic scale) and that 


H(z) = wiles) = g(@:=,). 


where the functional form g (usually monotone increasing) is specified. but @ is a random 
quantity. Commonly assumed forms include 


9(@:2;) = (6; tard, ) '. (the logistic model) 
and 
g(@:2,) =O, + O22, (the straight-line model). 
For any specified g(.: z,), the joint predictive density representation has the form 


Ce 
p(a(m)..... tu (My)) = i II I N(x, ly(d. 2). A)dQ(¢. d). 
POs yo yal 
where dQ(@. 4) specifying a prior distribution for @ € ® and \ € R°. 
As with Example 4.12. specification of Q might be facilitated if we reparametrise from 
@ to a more suitable (1-1) transformation, 7 = e#(@). In the logistic case, for example. we 
might take 4, = ©, ', corresponding to the “saturation” growth level reached as =, —- x. 
and vu» = (0, + 2)~', corresponding to the growth level at the “time origin”, z, = 0. Beliefs 
about ¢:;. 4'2 then acquire an operational meaning as beliefs about the average growth-levels. 
at times “oc” and “O”” respectively, that would be observed from a large number of replicate 
measurements. A third possible parameter to which investigators could easily relate in some 
applications might be ys = logla:do/(2e, + 02)j/ log(@s). the time at which growth is 
half-way from the initial to the final level. 
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Example 4.14. (Multiple regression). Suppose that, for each i = I1..... mM, Se- 
quences of real-valued random quantities r;,.2j2.... are to be observed, where each 1x,; 
is related to certain specified observed quantities z; = (z;;.....2,) and judgements are 
made which lead to the belief representation 


m ey 


ple (ry)... n(n) =f TL [Nii ee). (2) €Q(@(z)). 


7 eed yal 


where 
p(z,) = lim %, (7). A; '(z,) = lim s?(2). 


Hex 


6(z) = (41(21) see Han (Zin)s Ai(Z1). 6. Am ( Zn) 


and 8 = R” x (R')", with z = (z;,..., Sy). 

In many situations, the further judgements are made that \,(z,) = A, = Aand y;(2,) = 
p({z;).i = 1....,m. where A and j(.) are unknown, but the latter is assumed to be a 
“smooth” function, adequately approximated by a first-order Taylor expansion, so that, for 
some (unspecified) z*, 


H(z} = e(2") + (2, - 2) GV (2") = a, 
where we define 
Q; = (1. 21....,2;%) (row vector) 


and 
0 = (0,9)..... ,)' (column vector) 


with 
% = wz") — 2° 9 u(z*).9 = [Vu(z*)jni = 1... ke 
Conditional on ¢ = (0, 4), the joint distribution of 
r= (a, (71), cee -Zin(Thn)) 
is thus seen to be multivariate normal, N,,(2 | A@.A), where A is an n x k matrix (n = 
ny +-+++ My), whose rows consist of a, replicated n, times, followed by a, replicated no 


times, and soon, and A = AZ,,, with I, denoting the 7 x 7 identity matrix. The unconditional 
representation can therefore be written as 


pla) = boa.. N, (a | A@. X) dQ(0. A). 


It is conventional to refer to =,,. 22,... .as values of the regressor variables z')),j =1..... k, 
to @ as the vector of regression coefficients and A as the design matrix. The form (z) = A@ 
is called a regression equation and the structure 


E(a|A.@.) = A@ 
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is said to define a linear model. If k = 1, we have the simple regression (straight-line) 
model, E(x,,) = 0) + 912: fork > 2, we have a multiple regression model. 

From an operational point of view. beliefs about @ in the general case relate to be- 
liefs about the intercept (A) of the regression equation and the marginal rates of change 
(@,..... ,.) of the ;, with respect to the regressor variables (2,..... ie ). However, within 
this general structure we can represent various special cases such as 2‘) = 2! (polynomial 
regression) or z/) = sin(jH/N), for some N (a version of trigonometric regression). in 
these cases. beliefs about @ will stem from rather different considerations. 4 


Specification of the kinds of structures which we have illustrated in Examples 
4.12 to 4.14 essentially reduces to the same process as we have seen in earlier 
representations of joint predictive densities as integral mixtures. We proceed as if: 


(i) the random quantities are conditionally independent, given the values of the 
relevant covariates, z, and given the unknown parameters, @: 
(ii) the latter are assigned a prior distribution. dQ(@). 


In many cases, the likelihood, defined through conditional independence. in- 
volves familiar probability models, often of exponential family form (as with the 
binomial, normal and multivariate examples seen above), but with at least some 
of the usual “labelling” parameters replaced by more complex functional forms 
involving the covariates. From a conceptual point of view. this is all that really 
needs to be said for the time being. However. when we consider the applications 
of such models, together with the problems of computation. approximation. etc.. 
which arise in implementing the Bayesian learning process, it is often useful to 
have a more structured taxonomy in mind: for example. linear versus non-linear 
functional forms; normal versus non-normal distributions, and so on. 


4.6.5 Hierarchical Models 


In Section 4.6.2, we considered the general situation where several sequences of 
random quantities. rj;.7)2..... PS Vets m ate judged unrestrictedly infinitely 
exchangeable, leading typically to a joint density representation of the form 


ae 


P(e)... Lr (tin)) -| Il [Lo r,,|8,) dQ(a..... 0,.). 
ol 


pol 


We remarked at that time that nothing can be said. in general. about the prior spec- 
ification Q(O,..... 8,,), Since this must reflect whatever beliefs are appropriate for 
the specific application being modelled. However, it is often the case that addi- 
tional judgements about relationships among the » sequences lead to interestingly 
structured forms of Q(@)..... A). 
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In Section 4.6.1, we noted some of the possible contexts in which judgements 
of exchangeability might be appropriate not only for the random quantities within 
each of m separate sequence of observables, but also between the m strong law 
limits of appropriately defined statistics for each of the sequences, The following 
examples illustrate this kind of structured judgement and the forms of hierarchical 
model which result. 


Example 4.15. (Exchangeable binomial parameters ). Suppose that we have unre- 
strictedly infinitely exchangeable sequences of 0-1 random quantities, x,;,Zj2,..., with 
i=1,....m. Then, fori = 1,2,....[n,.yi(mi) = tn +--+ +4,,,;], is a sufficient statistic 
for the ith sequence and 


P(yi (21)... Mn (Mn )) = | P(Yr (Tr). - Yn (Tt) | 1, On )AQ(Qi $4hs 8 8 6,,) 


=f [] Bi(ui (ms) 18...) dQ(8:..- Bn). 
fo. fel 
where 

6, = lim (y,()/7)- 


As we remarked in Section 4.6.1, if the sequences consists of success-failure outcomes 
on repeated trials with 1 different (but, to all intents and purposes, “similar”) types of 
component, it might be reasonable to judge the m “long-run success frequencies” to be 
themselves exchangeable. This corresponds to specifying an exchangeable form of prior 
distribution for the parameters 6,....,9,,. lf the m types of component can be thought of 
as a selection from a potentially infinite sequence of similar components. we then have (see 
Section 4.3.3) the general representation 


Q(81,--- On) = few, ees 9, |G) dII(G) 


m 


= [ TI ew) ane. 
res | 


The complete model structure is then seen to have the hierarchical form 


Qi... |G) = [] Ge) 
i=] 


Il(G) 


In conventional terminology, the first stage of the hierarchy relates data to parameters via 
binomial distributions; the second stage models the binomial parameters as if they were a 
random sample from a distribution G;; the third, and final, stage specifies beliefs aboutG. 
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The above example is readily generalised to the case of exchangeable param- 
eters for any one-parameter exponential family. In practice, beliefs about G might 
concentrate on a particular parametric family, so that, assuming the existence of 
the appropriate densities, the prior specification takes the form 


(8 1.--+8m) = f gO... On, |b) db )= [Tl.16 |p) dll 
cy ? 


i=] 


and, for appropriate sufficient statistics yj(7,;).7 = 1..... m, defines the hierarchi- 
cal structure 


WA...» Om |) = [] 914) 
rl 


I(@). 


As before, the first stage of the hierarchy relates data to parameters in a form as- 
sumed to be independent of G’; the second stage now models the parameters as if 
they were a random sample from a parametric family labelled by the Ayperparam- 
eter b € ®; the third, and final, stage specifies beliefs about the hyperparameter. 
Such beliefs acquire operational significance by identifying the hyperparameter 
with appropriate strong law limits of observables, as we shall indicate in the fol- 
lowing example. 


Example 4.16. (Exchangeable normal mean parameters ). Suppose that we have 11 
unrestrictedly infinitely exchangeable sequences .rj)..?),2..... (ean te m. of real valued 
random quantities, for which (see Example 4.11) the joint density has the representation 


play(ay)..... Lin (thy )) = | I] [[ Ne, [ye A) dQua..... Han A). 
Free 
where we recall that A~! = lim, .. 92(/) and yz, = lim,,.., .#,,(¢), where 
ne) = (ry tes +a) ns. ‘(Ui) = = Seu. Saat 02) ee ee Ma 


So far as the specification of Q(j4.-..-. i. A) is concemed. we first nole that in many 
applications it is helpful to think in terms of 


QU... Mae AD QL Gees fas |AJQ WA) 
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for some Q,,,Q). In some cases, knowledge of the strong law limits of sums of squares 
about the mean may be judged irrelevant to the assessment of beliefs for strong law limits 
of the sample averages: in such cases, Q,,(#1, +--+ dln | A) will not depend on A. In other 
cases, we might believe, for example, that variation among the limiting sample averages is 
certainly bigger (or certainly smaller) than within-sequence variation of observations about 
the sample mean: in such cases, Q,,(t11,.-- + Hm | A) will involve A. In either case, it is useful 
to think in terms of the product form of Q. 

Now suppose that, conditional on 4, the limiting sample means are judged exchange- 
able. If the m sequences can be thought of as a selection from a potentially infinite collection 
of similar sequences, we have (see Section 4.3) a further representation of Q,, in the form 


Qultinessotin A) = fh Qpltts ++ Hm) AG) aT(G >) 
3 
= [ Tew» ane, 
Viz i 
The complete model then has the hierarchical structure 


P(@1 (M1)... Bin (Mn) | bre. + bw A) = [[ p(zi(ns) | 44; 4) 


i=1 
um 


Quis +f | AG) = T] G4 |) 


| 
TI(G | A)Q4(A). 
In practice, beliefs about G, given A, might concentrate on a particular parametric family, 
so that, assuming the existence of the appropriate densities, the hierarchical structure would 
take the form 


m 


p(x,(n1), we + Lay (Tay) | pa, - «ym A) = [[ pm) | 15. A) 


Gu(Hrs +++. fm |A,@) = [[ gue. |.) 
wl 


Tix(@| A) Qa(A). 
For an explicit example of this, suppose that, given a potentially infinite sequence jr), jo... . 
oe more sone int (1). oi lie . for wr large N,,N2,...)the quantities m, ~(m) = 
"(uy +++ + fin) and s?(m) = cba (eH, — B(m))? (or the large sample analogues 
a p(m) dnd s?(m)) were judged siffcien for the sequence. It would then be natural (see 
Section 4.5) to take g, (4; |A,@) = N(, | d1. 62). where 
o, = lim f(m), 2 = lim s?(m). 


mw max 
From an operational standpoint, the final stage specification of the joint prior distribution for 
$1, @2 and 2 then reduces to a specification of beliefs about the following limits of observable 
quantities (for large m and 7,,7 = 1,..., 7): 
(i) the mean of all the observations from all the sequences (1); 
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(ii) the mean sum of squares of the individual sequence means about the overall mean (@»): 
(iii) the mean (over sequences) of the mean sum of squares of observations within a sequence 
about the Sequence mean (A). 


The precise form of specification at this stage will, of course, depend on the particular 
situation in which the model is being applied. g 


Hierarchical modelling provides a powerful and flexible approach to the rep- 
resentation of beliefs about observables in extended data structures, and is being 
increasingly used in statistical modelling and analysis. This section has merely 
provided a brief introduction to the basic ideas and the way such structures arise 
naturally within a subjectivist, modelling framework. In the context of the Bayesian 
learning process, further brief discussion will be given in Section 5.6.4, where links 
will be made with empirical Bayes ideas. 

Anextensive discussion of hierarchical modelling will be given in the volumes 
Bayesian Computation and Bayesian Methods. A selection of references to the 
literature on inference for hierarchical models will be given in Section 5.6.4. 


4.7 PRAGMATIC ASPECTS 
4.7.1. Finite and Infinite Exchangeability 


The de Finetti representation theorem for 0-1 random quantities, and the vari- 
ous extensions we have been considering in this chapter. characterise forms of 
p(ai....,2%») for observables r)......r,, assumed to be part of an infinite ex- 
changeable sequence. However, in general, mathematical representations which 
correspond to probabilistic mixing over conditionally independent parametric forms 
do not hold for finite exchangeable sequences. 

To see this, consider 7 = 2 and finitely exchangeable 0-1 1... such that 


p(x, = 0,22 = 0) = p(n) = lag = 1) = 0 
P(x) = 1.22 = 0) = p(r, = 0.22 = 1) = ; 7 


If the de Finetti representation held. we would have 
1 1 
| &dQ(6) = | (1 — 0)°dQ(@) = 0. 
0 0 


for some Q(@), an impossibility since the latter would have to assign probability 
one to both 8 = 0 and @ = 1 (Diaconis and Freedman. 1980a). 
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It appears, therefore, that there is a potential conflict between realistic mod- 
elling (acknowledging the necessarily finite nature of actual exchangeability judge- 
ments) and the use of conventional mathematical representations (derived on the 
basis of assumed infinite exchangeability). 

To discuss this problem, let us call an exchangeable sequence, 2),...,2,, with 
x; € X, N-extendibie if it is part of the longer exchangeable sequence 2;,..., ZN. 
Practical judgements of exchangeability for specific observables 7,...,Zp are 
typically of this kind: the z;,...,2, can be considered as part of a larger, but 
finite, potential sequence of exchangeable observables. Infinite exchangeability 
corresponds to the possibly unrealistic assumption of N-extendibility forall N > n. 

In general, the assumption of infinite exchangeability implies that the proba- 
bility assigned to an event (z},...,2n) € E € X” is of the form 


Po(E) = J F"(E)dQ(F), 


for some Q. If we denote by P(E) the corresponding probability assigned under 
N-extendibility for a specific N, a possible measure of the “distortion” introduced 
by assuming infinite exchangeability is given by 


sup |P(E) — PQ(E)I, 


where the supremum is taken over all events in the appropriate o-field on X”. 
Intuitively, one might feel that if 2,,..., 2, is N-extendible for some N >> n, the 
“distortion” should be somewhat negligible. This is made precise by the following. 


Proposition 4.19. (Finite approximation of infinite exchangeability). 
With the preceding notation, there exists Q such that 


sup |P(E) ~ Po(E)| < £0)", 
E 


where f(n) is the number of elements in X, if the latter is finite, and f(n) = 
(n ~ 1) otherwise. 


Proof. See Diaconis and Freedman (1980a) for a rigorous statement and tech- 
nical details. g 


The message is clear and somewhat comforting. If a realistic judgement of 
N-extendibility for large, but finite, N is replaced by the mathematically conve- 
nient assumption of infinite exchangeability, no important distortion will occur in 
quantifying uncertainties. 

For further discussion, see Diaconis (1977), Jaynes (1986) and Hill (1992). 
For extensions of Proposition 4.19 to multivariate and linear model structures, see 
Diaconis et al. (1992). 
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4.7.2 Parametric and Nonparametric Modeis 


In Section 4.3, we saw that the assumption of exchangeability for a sequence 
I ,2o,... of real-valued random quantities implied a general representation of the 
joint distribution function of 1;..... x, of the form 


P(x1,...-2n) = [Lena 
~ pe] 


where 
Q(F) = lim P(F,) 
meee 
and F,, is the empirical distribution function defined by x1.....: r,. This implies 


that we should proceed as if we have a random sample from an unknown distribution 
function F’, with Q representing our beliefs about “what the empirical distribution 
would look like for large n”. 

As we remarked at the end of Section 4.3.3, the task of assessing and represent- 
ing such a belief distribution Q over the set 3 of all possible distribution functions 
is by no means straightforward, since F is, effectively, an infinite-dimensional pa- 
rameter. Most of this chapter has therefore been devoted to exploring additional 
features of beliefs which justify the restriction of 3 to families of distributions 
having explicit mathematical forms involving only a finite-dimensional labelling 
parameter. 

Conventionally, albeit somewhat paradoxically, representations in the finite- 
dimensional case are referred to as parametric models, whereas those involving the 
infinite-dimensional parameter are referred to as nonparametric models! The tech- 
nical key to Bayesian nonparametric modelling is thus seen to be the specification 
of appropriate probability measures over function spaces. rather than over finite- 
dimensional real spaces, as in the parametric case. For this reason, the Bayesian 
analysis of nonparametric models requires considerably more mathematical ma- 
chinery than the corresponding analysis of parametric models. In the rest of this 
volume we will deal exclusively with the parametric case, postponing a treatment 
of nonparametric problems to the volumes Bayesian Computation and Bayesian 
Methods. 

Among important references on this topic, we note Whittle (1958), Hill (1968, 
1988, 1992). Dickey (1969), Kimeldorf and Wahba (1970). Good and Gaskins 
(1971, 1980), Ferguson (1973, 1974), Leonard (1973). Antoniak (1974). Doksum 
(1974). Susarla and van Ryzin (1976), Ferguson and Phadia (1979). Dalal and Hall 
(1980). Dykstra and Laud (1981), Padgett and Wei (1981), Rolin (1983). Lo (1984). 
Thorburn (1986), Kestemont (1987), Berliner and Hill (1988). Wahba (1988). Hjort 
(1990). Lenk (1991) and Lavine (1992a). 

As we have seen, the use of specific parametric forms can often be given a 
formal motivation or justification as the coherent representation of certain forms of 
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belief characterised by invariance or sufficiency properties. In practice, of course, 
there are often less formal, more pragmatic, reasons for choosing to work with a 
particular parametric model (as there often are for acting, formally, as if particular 
forms of summary statistic were sufficient!). In particular, specific parametric mod- 
els are often suggested by exploratory data analysis (typically involving graphical 
techniques to identify plausible distributional shapes and forms of relationship with 
covariates), or by experience (i.e., historical reference to “similar” situations, where 
a given model seemed “to work”) or by scientific theory (which determines that 
a specific mathematical relationship “must” hold, in accordance with an assumed 
law”). In each case, of course, the choice involves subjective judgements; for 
example, regarding such things as the “straightness” of a graphical normal plot, the 
“similarity” between a current and a previous trial, and the “applicability of a theory 
to the situation under study. From the standpoint of the general representation the- 
orem, such judgements correspond to acting as if one has a Q which concentrates 
on a subset of S defined in terms of a finite-dimensional labelling parameter. 


4.7.3. Model Elaboration 


However, in arriving at a particular parametric model specification, by means of 
whatever combination of formal and pragmatic judgements have been deemed 
appropriate, a number of simplifying assumptions will necessarily have been made 
(either consciously or unconsciously). It would always be prudent, therefore, to 
“expand one’s consciousness” a little in relation to an intended model in order 
to review the judgements that have been made. Depending on the context, the 
following kinds of critical questions might be appropriate: 


(i) is it reasonable to assume that all the observables form a “homogeneous sam- 
ple”, or might a few of them be “aberrant” in some sense? 


(ii) is it reasonable to apply the modelling assumptions to the observables on 
their original scale of measurement, or should the scale be transformed to 
logarithms, reciprocals, or whatever? 


(iii) when considering temporally or spatially related observables, is it reasonable 
to have made a particular conditional independence assumption, or should 
some form of correlation be taken into account? 

(iv) if Some, but not all, potential covariates have been included in the model, is it 


reasonable to have excluded the others, or might some of them be important, 
either individually or in conjunction with covariates already included? 


We shall consider each of these possibilities in turn, indicating briefly the 
kinds of elaboration of the “first thought of” model that might be considered. 
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Outlier elaboration. Suppose that judgements about a sequence x). 22.... of 
real-valued random quantities have led to serious consideration of the model 


but, on reflection, it is thought wise to allow for the fact that (an unknown) one of 
I,....2, might be aberrant. If aberrant observations are assumed to be such that 
a sequence of them would have a limiting mean equal to j:, but a limiting mean 
square about the mean equal to (yA)~!.0 < + < 1. where jz and A! denote the 
corresponding limits for non-aberrant observations, a suitable form of elaborated 
model might be 


P(E}, .-++En) = N(x, | uA) dQ(.d 
P(X1,-+++ En) rf, Il (x, | #.A) dQ(u. A) 
an 1 ; 
+(1-m) J ype a NCE | 9A) [ING led) ) dQ(H.) dQ(9). 
mala al if) 


This model corresponds to an initial assumption that, with specified probability 7, 
there are no aberrant observations, but, with probability 1 — 7, there is precisely one 
aberrant observation, which is equally likely to be any one of r;..... r,,. General- 
isations to cover more than one possible aberrant observation can be constructed in 
an obviously analogous manner. Such models are usually referred to as “outlier” 
models, since y < 1 implies an increased probability that, in the observed sample 
ry,....2,,. the aberrant observation will “outlie”. Since for an aberrant observa- 
tion x. E{(a2 — yz)? | z, A. 4] = (yA)! prior belief in the relative inaccuracy of an 
aberrant observation as a “predictor” of j: is reflected in the weight attached by the 
prior distribution Q(+) to values of + much smaller than 1. 

De Finetti (1961) and Box and Tiao (1968) are pioneering Bayesian papers on 
this topic. More recent literature includes: Dawid (1973), O’Hagan (1979, 1988b, 
1990), Freeman (1980), Smith (1983). West (1984, 1985), Pettit and Smith (1985). 
Amaiz and Ruiz-Rivas (1986), Muirhead (1986), Pettit (1986. 1992). Guttman and 
Pefia (1988) and Pefia and Guttman (1993). 


Transformation elaboration. Suppose now that judgements about a sequence 
r},22,...0f real-valued random quantities are such that it seems reasonable to sup- 
pose that, ifa suitable + were identified, beliefs about the sequence ae ie yagk 
defined by 

rle( Diy #007) 


= log(r;) (~ = 0). 
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would plausibly have the representation 


pla....2) =f TENG! 1a aldQ"u, 19): 


ReRt =I 


It then follows that 


p(x. ein) -/ 
Rx Rt x. 


where 


rn 


N(x!” | 2, A)J(ae,7) dQ*(u, Aly) dQ* (7) 
1 


i= 


J(x,7) = |] lax!” dz). 
i=l 


The case = 1 corresponds to assuming a normal parametric model for the ob- 
servations on their original scale of measurement. If I‘ includes values such as 
y = -l,y = 1/2,7 = 0, the elaborated model admits the possibility that trans- 
formations such as reciprocal, square root, or logarithm, might provide a better 
scale on which to assume a normal parametric model. Judgements about the rela- 
tive plausibilities of these and other possible transformations are then incorporated 
in Q*. For detailed developments see Box and Cox (1964), Pericchi (1981) and 
Sweeting (1984, 1985). 


Correlation elaboration. Suppose that judgements about 7), 2»,... again lead 
to a “first thought of model in which 


P(t1,.--,2n| is A) = TP N(x | H,). 


t=} 


but that it is then recognised that there may be a serial correlation structure among 
Z\,....Z, (Since, for example, the observations correspond to successive time- 
points, ¢ = 1,t = 2, etc.) A possible extension of the representation to incorporate 
such correlation might be to assume that, fora given y € [—1, 1), and conditional on 
pand A, the correlation between x, and x), is given by R(2j.20,4n | u,A.¥) = 7". 
so that 


pla | u,A,y) = pla... tn | Me Ay) = Na(z| el, AT"), 


where 
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The elaborated model then becomes. for some Q*. Q*. 
P(ay.....: r,) = | N,,(a| pl. AP ') dQ? (pe. d] 4) dQ* (4) 
ReoRt (0) 


The “first thought of” model corresponds to + = 0 and beliefs about the relative 
plausibility of this value compared with other possible values of positive or negative 
correlation are reflected in the specification of Q*. 


Covariate elaboration. Suppose that the “first thought of” model for the ob- 


servables 2 = (a2 (7). .... 2,(Mn)), where a,(t7,) = (4j)...-.. Vy,) denotes 
replicate observations corresponding to the observed value z, = (2,)..... 24) of 
the covariates 2,.....2,. is the multiple regression model with representation 


p(x) = / N,,(a | AO, A) dQ{@. A) 
RATE pe 


as described in Example 4.16 of Section 4.6. If it is subsequently thought that 
covariates z*+!, |... z' should also have been taken into account, a suitable elabo- 


ration might take the form of an extended regression model 
p(x) = i: N,,(ax|A@+ By.) dQ*(8.+. 4). 
Jrltlyart 


where B consists of rows containing 6, = (2,4,1.-.-. 24) replicated x, times. 
i= l,....mand+y = (Op4)..... 9,) denotes the regression coefficients of the 
additional regressor variables 2;,.).....2;. The value -y = 0 corresponds to the 
“first thought of’ model. 

In all these cases, an initially considered representation of the form 


pa) = : p(x |) dQ(¢) 


is replaced by an elaborated representation 


p(x) = [velen dQ"(.). 


the latter reducing to the original representation on setting the elaboration parameter 
+y equal to 0. Inference about such a +, imaginatively chosen to reflect interesting 
possible forms of departure from the original model. often provides a natural basis 
for checking on the adequacy of an initially proposed model. as well as learning 
about the directions in which the model needs extending. 

Other Bayesian approaches to the problem of covariate selection include 
Bernardo and Bermiidez (1985). Mitchell and Beauchamp (1988) and George and 
McCulloch (1993a). 
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4.7.4 Model Simplification 


The process of model elaboration, outlined in the previous section, consists in 
expanding a “first thought of” model to include additional parameters (and possibly 
covariates), reflecting features of the situation whose omission from the original 
model formulation is, on reflection, thought to be possibly injudicious. 

The process of model simplification is, in a sense, the converse. In review- 
ing a currently proposed model, we might wonder whether some parameters (or 
covariates) have been unnecessary included, in the sense that a simpler form of 
model might be perfectly adequate. As it stands, of course, this latter consider- 
ation is somewhat ill-defined: the “adequacy”, or otherwise, of a particular form 
of belief representation can only be judged in relation to the consequence arising 
from actions taken on the basis of such beliefs. These and other questions relating 
to the fundamentally important area of model comparison and model choice will 
be considered at length in Chapter 6. For the present, it will suffice just to give 
an indication of some particular forms of model simplification that are routinely 
considered. 


Equality of parameters. In Section 4.6, we analysed the situation where sev- 
eral sequences of observables are judged unrestrictedly infinitely exchangeable, 
leading to a general representation of the form 


m my 


plaer(m)..--n(rm)) = ff TLL] r(e 18) dQ(6,.-.- On) 


1=1] j=1 


where 0; € 6;,6* = J]J;”, 6, and the parameter 6; relating to the ith sequence 
can typically be interpreted as the limit of a suitable summary statistic for the zth 
sequence. If, on the other hand, the simplifying judgement were made that, in fact, 
the labelling of the sequences is irrelevant and that any combined collection of 
observables from any or all of the sequences would be completely exchangeable, 
we would have the representation 


pleer(m),..-.2a(%m) =f TT [Lees 18) 4900) 


i=) j=l 


where the same parameter 0 € © now suffices to label the parametric model for 
each of the sequences. In conventional terminology, the simplified representation 
is sometimes referred to as the null-hypothesis (6; = --- = 6,,) and the original 
representation as the alternative hypothesis (0, # --- # On). As we saw in Sec- 
tion 4.6 (for the case of two 0-1 sequences), rather than opt for sure for one or other 
of these representations, we could take a mixture of the two (with weight 7, say, on 
the null representation and 1 — 7 on the alternative, general, representation). This 
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form of representation will be considered in more detail in Chapter 6, where it will 
be shown to provide a possible basis for evaluating the relative plausibility of the 
“null and alternative hypotheses” in the light of data. 


Absence of effects. In Section 4.6, we considered the situation of a structured 
layout with replicate sequences of observations in each of JJ cells, and a possible 


parametric model representation involving row effects (ay..... cy), column effects 
(3),..-.3,7) and interaction effects (71,.....41s). A commonly considered sim- 
plifying assumption is that there are no interaction effects (9;) = --: = 4, = 0), 


so that large sample means in individual cells are just the additive combination of 
the corresponding large sample row and column means. 

Further possible simplifying judgements would be that the row (or column) 
labelling is irrelevant, so that a, = --- = a, = 0 (or 3, = --- = 3; = 0) and 
large sample cell means coincide with column (or row) means. Again, conventional 
terminology would refer to these simplifying judgements as “null hypotheses”. 


Omission of covariates. Considering, for example, the multiple regression 
case, described in Example 4.14 of Section 4.6 and reconsidered in the previous 
section on model elaboration, we see that here the simplification process is very 
clearly just the converse of the elaboration process. If -y denotes the regression coef- 
ficients of the covariates we are considering omitting. then the model corresponding 
to + = 0 provides the required simplification. 

In fact, in all the cases of elaboration which we considered in the previous 
section, setting the “elaboration parameter” + to 0 provides a natural form of simpli- 
fication of potential interest. Whether the process of model comparison and choice 
is seen as one of elaboration or of simplification is then very much a pragmatic issue 
of whether we begin with a “smaller” model and consider making it “bigger”. or 
vice versa. In any case, issues of model comparison and choice require a separate 
detailed and extensive treatment, which we defer until Chapter 6. 


4.7.5 Prior Distributions 


The operational subjectivist approach to modelling views predictive models as rep- 
resentations of beliefs about observables (including limiting, large-sample func- 
tions of observables, conventionally referred to as parameters). Invariance and 
sufficiency considerations have then been shown to justify a structured approach to 
predictive models in terms of integral mixtures of parametric models with respect 
to distributions for the labelling parameters. In familiar terminology. we specify 
a distribution for the observables conditional on unknown parameters (a sampling 
distribution, defining a likelihood), together with a distribution for the unknown 
parameters (a prior distribution). It is the combination of prior and likelihood 
which defines the overall model. \n terms of the mixture representation, the spec- 
ification of a prior distribution for unknown parameters is therefore an essential 
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and unavoidable part of the process of representing beliefs about observables and 
hence of learning from experience. 


From the operational, subjectivist perspective, it is meaningless to approach 
modelling solely in terms of the parametric component and ignoring the prior 
distribution. We are, therefore, in fundamental disagreement with approaches to 
statistical modelling and analysis which proceed only on the basis of the sampling 
distribution or likelihood and treat the prior distribution as something optional, 
irrelevant, or even subversive (see Appendix B). 


That said, it should be readily acknowledged that the process of representing 
prior beliefs itself involves a number of both conceptual and practical difficulties, 
and certainly cannot be summarily dealt with in a superficial or glib manner. 

From aconceptual point of view, as we have repeatedly stressed throughout this 
chapter, prior beliefs about parameters typically acquire an operational significance 
and interpretation as beliefs about limiting (large-sample) functions of observables. 
Care must therefore obviously be taken to ensure that prior specifications respect 
logical or other constraints pertaining to such limits. Often, the specification process 
will be facilitated by suitable “reparametrisation”. 

From a practical point of view, detailed treatment of specific cases is very 
much a matter of “methods” rather than “theory” and will be dealt with in the third 
volume of this series. However, a general overview of representation strategies, 
together with a number of illustrative examples, will be given in the inference 
context in Chapter 5. In particular, we shall see that the range of creative possibilities 
opened up by the consideration of mixtures, asymptotics, robustness and sensitivity 
analysis, as well as novel and flexible forms of inference reporting, provides a rich 
and illuminating perspective and framework for inference, within which many of the 
apparent difficulties associated with the precise specification of prior distributions 
are seen to be of far less significance than is commonly asserted by critics of the 
Bayesian approach. 


4.8 DISCUSSION AND FURTHER REFERENCES 


4.8.1 Representation Theorems 


The original representation theorem for exchangeable 0 — | random quantities ap- 
pears in de Finetti (1930), the concept of exchangeability having been considered 
earlier by Haag (1924) and also in the early 1930’s by Khintchine (1932). Exten- 
sions to the case of general exchangeable random quantities appear in de Finetti 
(1937/1964) and Dynkin (1953), with an abstract analytical version appearing in 
Hewitt and Savage (1955). Seminal extensions to more complex forms of sym- 
metry (partial exchangeability) can be found in de Finetti (1938) and Freedman 
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(1962). See Diaconis and Freedman (1980b) and Wechsler (1993) for overviews 
and generalisations of the concept of exchangeability. 

Recent and current developments have generated an extensive catalogue of 
characterisations of distributions via both invariance and sufficiency conditions. 
Important progress is made in Diaconis and Freedman (1984, 1987, 1990) and 
Kiichler and Lauritzen (1989). See, also, Ressel (1985). Useful reviews are given by 
Aldous (1985), Diaconis (1988a) and. from a rather different perspective, Lauritzen 
(1982, 1988). The conference proceedings edited by Koch and Spizzichino (1982) 
also provides a wealth of related material and references. For related developments 
from a reliability perspective. see Barlow and Mendel (1992, 1994) and Mendel 
(1992). 


4.8.2 Subjectivity and Objectivity 


Our approach to modelling has been dictated by a subjectivist, operational con- 
cer with individual beliefs about (potential) observables. Through judgements of 
symmetry, partial symmetry, more complex invariance or sufficiency, we have seen 
how mixtures over conditionally independent ‘“‘parameter-labelled”™ forms arise as 
typical representations of such beliefs. We have noted how this illuminates, and 
puts into perspective, linguistic separation into “likelihood” (or “sampling model”) 
and “prior” components. But we have also stressed that, from our standpoint, the 
two are actually inseparable in defining a belief model. 

In contrast, traditional discussion of a statistical model typically refers to the 
parametric form as “the model”. The latter then defines “objective” probabilities 
for outcomes defined in terms of observables, these probabilities being determined 
by the values of the “unknown parameters”. It is often implicit in such discussion 
that if the “true” parameter were known, the corresponding parametric form would 
be the “true” model for the observables. Clearly. such an approach seeks to make 
a very clear distinction between the nature of observables and parameters. It is as 
if, given the “true” parameter, the corresponding parametric distribution is seen as 
part of “objective reality”, providing the mechanism whereby the observables are 
generated. The “prior”. on the other hand, is seen as a “subjective” optional extra, a 
potential contaminant of the objective statements provided by the parametric model. 

Clearly, this view has little in common with the approach we have systemat- 
ically followed in this volume. However, there is an interesting sense. even from 
our standpoint, in which the parametric model and the prior can be seen as having 
different roles. 

Instead of viewing these roles as corresponding to an objective/subjective 
dichotomy, we view them in terms of an intersubjective/subjective dichotomy (fol- 
lowing Dawid, 1982b, 1986b). To this end. consider a group of Bayesians, all 
concemed with their belief distributions for the same sequence of observables. In 
the absence of any general agreement over assumptions of symmetry. invariance or 
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sufficiency, the individuals are each simply left with their own subjective assess- 
ments. However, given some set of common assumptions, the results of this chapter 
imply that the entire group will structure their beliefs using some common form of 
mixture representation. Within the mixture, the parametric forms adopted will be 
the same (the intersubjective component), while the priors for the parameter will 
differ from individual to individual (the subjective component). Such intersubjec- 
tive agreement clearly facilitates communication within the group and reduces areas 
of potential disagreement to just that of different prior judgements for the parame- 
ter. As we shall see in Chapter 5, judgements about the parameter will tend more 
towards a consensus as more data are acquired, so that such a group of Bayesians 
may eventually come to share very similar beliefs, even if their initial judgements 
about the parameter were markedly different. We emphasise again, however, that 
the key element here is intersubjective agreement or consensus. We can find no 
teal role for the idea of objectivity except, perhaps, as a possibly convenient, but 
potentially dangerously misleading, “shorthand” for intersubjective communality 
of beliefs. 


4.8.3 Critical Issues 


We conclude this chapter on modelling with some further comments concerning 
(i) The Role and Nature of Models, (ii) Structural and Stochastic Assumptions, (iii) 
Identifiability and (iv) Robustness Considerations. 


The Role and Nature of Models 


In the approach we have adopted, the fundamental notion of a model is that of a 
predictive probability specification for observables. However, the forms of repre- 
sentation theorems we have been discussing provide, in typical cases, a basis for 
separating out, if required, two components; the parametric model, and the belief 
model for the parameters. Indeed, we have drawn attention in Section 4.8.2 to 
the fact that shared structural belief assumptions among a group of individuals can 
imply the adoption of a common form of parametric model, while allowing the 
belief models for the parameters to vary from individual to individual. One might 
go further and argue that without some element of agreement of this kind there 
would be great difficulty in obtaining any meaningful form of scientific discussion 
or possible consensus. 

Non-subjectivist discussions of the role and nature of models in statistical 
analysis tend to have a rather different emphasis (see, for example, Cox, 1990, 
and Lehmann, 1990). However, such discussions often end up with a similar 
message, implicit or explicit, about the importance of models in providing a focused 
framework to serve as a basis for subsequent identification of areas of agreement and 
disagreement. In order to think about complex phenomena, one must necessarily 
work with simplified representations. In any given context, there are typically 
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a number of different choices of degrees of simplification and idealisation that 
might be adopted and these different choices correspond to what Lehmann calls “a 
reservoir of models”, where 


... particular emphasis is placed on transparent characterisations or descriptions 
of the models that would facilitate the understanding of when a given model is 
appropriate, (Lehmann, 1990) 


But appropriate for what? Many authors—including Cox and Lehmann— 
highlight a distinction between what one might call scientific and technological 
approaches to models. The essence of the dichotomy is that scientists are assumed to 
seek explanatory models, which aim at providing insight into and understanding of 
the “true” mechanisms of the phenomenon under study. whereas technologists are 
content with empirical models, which are not concerned with the “truth”. but simply 
with providing a reliable basis for practical action in predicting and controlling 
phenomena of interest. 

Put very crudely, in terms of our generic notation, explanatory modellers 
take the form of p(x |@) very seriously, whereas empirical modellers are simply 
concerned that p(x) “works”. For an elaboration of the latter view, see Leonard 
(1980). 

The approach we have adopted is compatible with either emphasis. As we 
have stressed many times, it is observables which provide the touchstone of ex- 
perience. When comparing rival belief specifications, all other things being equal 
we are intuitively more impressed with the one which consistently assigns higher 
probabilities to the things that actually happen. If. in fact, a phenomenon is gov- 
erned by the specific mechanism p(x | @) with @ = @, a scientist who discovers 
this and sets p(x) = p(x | Ay) will certainly have a p(x) that “works”. 

However, we are personally rather sceptical about taking the science versus 
technology distinction too seriously. Whilst we would not dispute that there are 
typically real differences in motivation and rhetoric between scientists and technol- 
ogists, it seems to us that theories are always ultimately judged by the predictive 
power they provide. Is there really a meaningful concept of “truth” in this context 
other than a pragmatic one predicated on p(a}? We shall return to this issue in 
Chapter 6, but our prejudices are well-captured in the adage: “all models are false. 
but some are useful”. 


Structural and Stochastic Assumptions 


In Section 4.6. we considered several illustrative examples where, separate from 
considerations about the complete form of probability specification to be adopted. 
the key role of the parametric model component p(z | 8) was to specify structured 
forms of expectations for the observables conditional on the parameters. We recall 
two examples. 


4.8 Discussion and Further References 239 


In the case of observables 2;;, in a two-way layout with replications (Sec- 
tion 4.6.3), with parameters corresponding to overall mean, main effects and inter- 
actions, we encountered the form 


E(Xijx) = wt at By t+ iyi 


in the case of a vector of observables x in a multiple regression context with design 
matrix A (Section 4.6.4, Example 4.14), we encountered the form 


E(a) = A@. 


In both of these cases, fundamental explanatory or predictive structure is cap- 
tured by the specification of the conditional expectation, and this aspect can in many 
cases be thought through separately from the choice of a particular specification of 
full probability distribution. 


Identifiability 

A parametric model for which an element of the parametrisation is redundant is said 
to be non-identified. Such models are often introduced at an early stage of mode] 
building (particularly in econometrics) in order to include all parameters which may 
originally be thought to be relevant. Identifiability is a property of the parametric 
model, but a Bayesian analysis of a non-identified model is always possible if a 
proper prior on all the parameters is specified. For detailed discussion of this issue, 
see Morales (1971), Dréze (1974), Kadane (1974), Florens and Mouchart (1986), 
Hills (1987) and Florens et a/. (1990, Section 4.5). 


Robustness Considerations 


For concreteness, in our earlier discussion of these examples we assumed that the 
p(x | @) terms were specified in terms of normal distributions. As we demonstrated 
earlier in this chapter, under the a priori assumption of appropriate invariances, or 
on the basis of experience with particular applications, such a specification may 
well be natural and acceptable. However, in many situations the choice of a specific 
probability distribution may feel a much less “secure” component of the overall 
modelling process than the choice of conditional expectation structure. 

For example, past experience might suggest that departures of observables 
from assumed expectations resemble a symmetric bell-shaped distribution cen- 
tred around zero. But a number of families of distributions match these general 
characteristics, including the normal, Student and logistic families. Faced with a 
seemingly arbitrary choice, what can be done in a situation like this to obtain further 
insight and guidance? Does the choice matter? Or are subsequent inferences or 
predictions robust against such choices? 
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An exactly analogous problem arises with the choice of mathematical speci- 
fications for the prior model component. 

In robustness considerations, theoretical analysis— sometimes referred to as 
“what if?” analysis —has an interesting role to play. Using the inference machinery 
which we shall develop in Chapter 5. the desired insight and guidance can often 
be obtained by studying mathematically the ways in which the various “arbitrary” 
choices affect subsequent forms of inferences and predictions. Forexample.a “what 
if?” analysis might consider the effect of a single, aberrant, outlying observation on 
inferences for main effects in a multiway layout under the alternative assumptions of 
anormal or Student parametric model distribution. It can be shown that the influence 
of the aberrant observation is large under the normal assumption. but negligible 
under the Student assumption. thus providing a potential basis tor preferring one 
or other of the otherwise seemingly arbitrary choices. 

More detailed analysis of such robustness issues will be given in Section 5.6.3. 
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Chapter 5 


Inference 


Summary 


The role of Bayes’ theorem in the updating of beliefs about observables in the 
light of new information is identified and related to conventional mechanisms 
of predictive and parametric inference. The roles of sufficiency, ancillarity and 
stopping rules in such inference processes are also examined. Forms of common 
statistical decisions and inference summaries are introduced and the problems of 
implementing Bayesian procedures are discussed at length. In particular, conju- 
gate, asymptotic and reference forms of analysis and numerical approximation 
approaches are detailed. 


5.1 THE BAYESIAN PARADIGM 
5.1.1. Observables, Beliefs and Models 


Our development has focused on the foundational issues which arise when we aspire 
to formal quantitative coherence in the context of decision making in situations 
of uncertainty. This development, in combination with an operational approach 
to the basic concepts, has led us to view the problem of statistical modelling as 
that of identifying or selecting particular forms of representation of beliefs about 
observables. 
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For example, in the case of a sequence 2"). .r2.....of 0 — | random quantities 
for which beliefs correspond to a judgement of infinite exchangeability, Propo- 
sition 4.1, (de Finetti’s theorem) identifies the representation of the joint mass 
function for x,;..... X, as having the form 


1 mn 
p(ty..... 2p) ={ [[ 072 - 9)! Qc). 
Sth 


for some choice of distribution Q over the interval (0. 1]. 

More generally, for sequences of real-valued or integer-valued random quan- 
tities, r,.a2..... we have seen, in Sections 4.3 — 4.5, that beliefs which combine 
judgements of exchangeability with some form of further structure (either in terms 
of invariance or sufficient statistics), often lead us to work with representations of 
the form 


p(zy...-, a, )= iL []>¢e, | 9) dQ(8). 


where p(x | @) denotes a specified form of labelled family of probability distribu- 
tions and Q is some choice of distribution over R*. 

Such representations, and the more complicated forms considered in Sec- 
tion 4.6, exhibit the various ways in which the element of primary significance 
from the subjectivist, operationalist standpoint, namely the predictive model of 
beliefs about observables, can be thought of as if constructed from a parametric 
model together with a prior distribution for the labelling parameter. 

Our primary concern in this chapter will be with the way in which the updating 
of beliefs in the light of new information takes place within the framework of such 
representations. 


5.1.2 The Role of Bayes’ Theorem 


In its simplest form, within the formal framework of predictive model belief dis- 
tributions derived from quantitative coherence considerations, the problem corre- 
sponds to identifying the joint conditional density of 


P(Dapieee ese Mot |E1..+.,Ln) 
for any m > 1, given, for any n > 1, the form of representation of the joint density 


P(L1,- +++ Ln). 
In general, of course, this simply reduces to calculating 


P(Ln41 Soe Pte [aries Bh) = 
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and, in the absence of further structure, there is little more that can be said. How- 
ever, when the predictive model admits a representation in terms of parametric 
models and prior distributions, the learning process can be essentially identified, in 
conventional terminology, with the standard parametric form of Bayes’ theorem. 

Thus, for example, if we consider the general parametric form of representation 
for an exchangeable sequence, with dQ(@) having density representation, p(@)d@, 
we have 


P(21,---,2n) = [ Tle iene) dé 
t=) 


from which it follows that 
J TTY" p(x: | )p(9) 0 


Laalieeot Lata Ziv ka) = 
P(Zn41 +m |Z1 )= fae 5 \p(6) dO 
n+m 
=f II p(x; |0)p(O|21,...,2n) dO, 
tnt] 
where 
p(O|t1.....2n) = = Hee) 


J TT; p(ai | @)p(@) 40 


This latter relationship is just Bayes’ theorem, expressing the posterior density 


for 8, given r),....2n, in terms of the parametric model for 2,,...,Zn given 0, 
and the prior density for 9. The (conditional, or posterior) predictive model for 
In41)-++sZntms Ziven L1,..., Lp is seen to have precisely the same general form 


of representation as the initial predictive model, except that the corresponding para- 
metric model component is now integrated with respect to the posterior distribution 
of the parameter, rather than with respect to the prior distribution. 

We recall from Chapter 4 that, considered as a function of 0, 


lik(@ | 2),..-,2n) = p(Z}...-,2n| 4) 


is usually referred to as the likelihood function. A formal definition of such a 
concept is, however, problematic; for details, see Bayarri er al. (1988) and Bayarri 
and DeGroot (1992b). 


5.1.3 Predictive and Parametric Inference 


Given our operationalist concern with modelling and reporting uncertainty in terms 
of observables, it is not surprising that Bayes’ theorem, in its role as the key to 
a coherent learning process for parameters, simply appears as a step within the 
predictive process of passing from 


P(t1...-.2n) = [oe geathe Z| 9)p(@) dO 
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to 


P(2nttes es Enam [Lier By) = [rleen reer Lnem | O)p(O | .2y......0,) dd. 


by means of 
p(t1,....2, | O)p(A) 
6|2).....2,) = 
PD Pian seen) f play... . 2%, |@)p(0) dO 
Writing y = {y.---. Ym} = (2nate.-+stnem} to denote future (or, as 
yet unobserved) quantities and 2 = {x,.....2,} to denote the already observed 


quantities, these relations may be re-expressed more simply as 


p(@) = [ v(7i9ne(6) ao. 


o{y |x) = i p(y |@)p(8 |) 46 


and 
p(O| x) = p(x | O)p(8)/p(z). 


However, as we noted on many occasions in Chapter 4, if we proceed purely 
formally, from an operationalist standpoint it is not at all clear, at first sight, how we 
should interpret “beliefs about parameters”. as represented by p(@) and p(@| x), 
or even whether such “beliefs” have any intrinsic interest. We also answered these 
questions on many occasions in Chapter 4, by noting that. in all the forms of 
predictive model representations we considered, the parameters had interpretations 
as strong law limits of (appropriate functions of) observables. Thus, for example, 
in the case of the infinitely exchangeable 0 - | sequence (Section 4.3.1) beliefs 
about @ correspond to beliefs about what the long-run frequency of |’s would be 
in a future sample; in the context of a real-valued exchangeable sequence with 
centred spherical symmetry (Section 4.4.1), beliefs about jz and a”, respectively, 
correspond to beliefs about what the large sample mean, and the large sample mean 
sum of squares about the sample mean would be, in a future sample. 

Inference about parameters is thus seen to be a limiting form of predictive 
inference about observables. This means that, although the predictive form is 
primary, and the role of parametric inference is typically that of an intermediate 
structural step, parametric inference will often itself be the legitimate end-product 
of a statistical analysis in situations where interest focuses on quantities which 
could be viewed as large-sample functions of observables. Either way, parametric 
inference is of considerable importance for statistical analysis in the context of the 
models we are mainly concerned with in this volume. 
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When a parametric form is involved simply as an intermediate step in the 
predictive process, we have seen that p(@|2).....2,), the full joint posterior 
density for the parameter vector @ is all that is required. However, if we are 
concerned with parametric inference per se, we may be interested in only some 
subset, @, of the components of 8, or in some transformed subvector of parameters, 
g(@). For example, in the case of a real-valued sequence we may only be interested 
in the large-sample mean and not in the variance; or in the case of two 0 - | 
sequences we may only be interested in the difference in the long-run frequencies. 

In the case of interest in a subvector of @, let us suppose that the full parameter 
vector can be partitioned into 9 = {@, A}, where ¢ is the subvector of interest, 
and A is the complementary subvector of 8, often referred to, in this context, as the 
vector of nuisance parameters. Since 


. _ P(x|)p(8) 
p(6|x) = eae 


the (marginal) posterior density for @ is given by 


p(o|2) = f (@|2)da= f r(0.Al2) dd. 
where 
pla) = f r(x|6)p(6) a9 = f r(w|e,A)p(d, Aldo dd, 
with all integrals taken over the full range of possible values of the relevant quan- 


tities. 
Expressed in terms of the notation introduced in Section 3.2.4, we have 


P(x, A) ® o(, A) = (Al), 
r(G.A|2) — (|=). 
In some situations, the prior specification p(@, A) may be most easily arrived 


at through the specification of p(A | @)p(@). In such cases, we note that we could 
first calculate the integrated likelihood for @, 


p(e|6) = f ple|o.ap(rle)da, 
and subsequently proceed without any further need to consider the nuisance pa- 


rameters, since (x|@)p(¢) 
_ P(z|)P(P) | 
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In the case where interest is focused on a transformed parameter vector. g(@). 
we proceed using standard change-of-variable probability techniques as described 
in Section 3.2.4, Suppose first that % = g(@) is a one-to-one differentiable trans- 
formation of @. It then follows that 


pe( |x) = pag! () |) | J,-1() |. 


where 
dg '(p) 

Oy 
is the Jacobian of the inverse transformation @ = g~!(~). Alternatively, by sub- 
stituting @ = g™ Bis we could write p(x |@) as p(x | ane and i as p(9) by 
po(g '(b)) | J,-1(2) |. to obtain p( | x) = p(x | h)p(h)/p(a) directly. 


If x = g(@) has dimension less than @, we can typically define -y = (~.w) = 
h(@), for some w such that -y = h(@) is a one-to-one differentiable transformation, 
and then proceed in two steps. We first obtain 


J,-1(b) = 


P(p.w |x) = po(h '(y) |x) | J,-1(7) |. 


where re 
Ji) = oh ’ 
oY 


and then marginalise to 


pip |x) = = plb.w) 


These techniques will be used extensively in later sections of this chapter. 

In order to keep the presentation of these basic manipulative techniques as 
simple as possible, we have avoided introducing additional notation for the ranges 
of possible values of the various parameters. In particular. all integrals have been 
assumed to be over the full ranges of the possible parameter values. 

In general, this notational economy will cause no confusion and the parameter 
ranges will be clear from the context. However, there are situations where specific 
constraints on parameters are introduced and need to be made explicit in the analysis. 
In such cases, notation for ranges of parameter values will typically also need to be 
made explicit. 

Consider, for example, a parametric model, p(x | @), together with a prior 
specification p(@), @ € ©, for which the posterior density, suppressing explicit use 
of ©, is given by 

p(x | @)p(@) 


(812) = Fe apd 
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Now suppose that it is required to specify the posterior subject to the constraint 
@ € Go C O, where Soo p(8)d@ > 0. 
Defining the constrained prior density by 


p(9) 
6) = ———.: 0€%. 
pol) So, p(8)d(@) . 
we obtain, using Bayes’ theorem, 
p(x | @)po(9) 


p(O|x,0 € Gy) OE Op. 


~ Jo, P(e 1 @)p0(8)d0 * 


From this, substituting for »9(@) in terms of p(@) and dividing both numerator and 
denominator by 


p(x) = i. p(x | )p(0)d8, 


we obtain 
___p(Olz) 
Jo, PO |x) dé 


expressing the constraint in terms of the unconstrained posterior (a result which 

could, of course, have been obtained by direct, straightforward conditioning). 
Numerical methods are often necessary to analyze models with constrained 

parameters; see Gelfand et al. (1992) for the use of Gibbs sampling in this context. 


p(@|x,@ € Op) 6 € Op, 


5.1.4 Sufficiency, Ancillarity and Stopping Rules 


The concepts of predictive and parametric sufficient statistics were introduced in 
Section 4.5.2, and shown to be equivalent, within the framework of the kinds of 
models we are considering in this volume. In particular, it was established that 
a (minimal) sufficient statistic, t(a), for @, in the context of a parametric model 
p(x | @), can be characterised by either of the conditions 


p(@|x) =p(O|t(x)), —_ for all p(@), 
or 


p(x|t(x), 0) = p(x|t(z)). 


The important implication of the concept is that (a) serves as a sufficient summary 
of the complete data x in forming any required revision of beliefs. The resulting data 
reduction often implies considerable simplification in modelling and analysis. In 
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many cases, the sufficient statistic £(a) can itself be partitioned into two component 
statistics, (a) = [a(a). s(a)] such that, for all 6. 


p(t(x)|@) = p(s(a) | a(x). 8) p(a(a) | A) 
= p(s(x)|a(a).0)p(a(a)). 


It then follows that, for any choice of p(). 


p(O| x) = p(O|t(ax)) x p(t(a) |) p() 
x p(s(x) | a(x). @)p(8). 


so that, in the prior to posterior inference process defined by Bayes’ theorem, it 
suffices to use p(s(x) | a(a). 0), rather than p(t(a) | @) as the likelihood function. 
This further simplification motivates the following definition. 


Definition 5.1. (Ancillary statistic). A statistic, a(x), is said to be ancillary, 
with respect to @ in a parametric model p(x | @), if p(a(x) |@) = p(a(a)) for 
all values of 0. 


Example 5.1. (Bernoulli model). \n Example 4.5, we saw that for the Bernoulli 
parametric model 


Legere r 14) = [] ple, 18) = a (1 = ay, 
ret 


which only depends on and r,, = ry) + +--+ .0,. Thus, t, = [n.1,] provides a minimal 
sufficient statistic, and one may work in terms of the joint probability function p(n. 1, | @). 
If we now write 
p(nee, | 8) = plr, | o. @)p(n |). 


and make the assumption that, for all 7 > 1, the mechanism by which the sample size, n, is 
arrived at does not depend on @, so that p(1 | @) = p(ir).n > 1. we see that n is ancillary 
for @, in the sense of Definition 5.1. It follows that prior to posterior inference for 9 can 
therefore proceed on the basis of 


pO la) = plOtacr,) x p(n, | a. @)p(é@). 
for any choice of p(@).0 < @ < 1, From Corollary 4.1. we see that 
n : 
: aa -6y, O<r, Sn. 


p(r, |nd) = ( 


= Bilr,, |. 1). 


ry 
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so that inferences in this case can be made as if we had adopted a binomial parametric model. 
However, if we write 
P(N, Tr |) = p(nr| rn, 4)p(r, | 8) 


and make the assumption that, for all r, > 1, termination of sampling is governed by a 
mechanism for selecting r,,, which does not depend on 6, so that p(r,, | 9) = p(r;,).%n > 1, 
we see that r, is ancillary for @, in the sense of Definition 5.1. It follows that prior to posterior 


inference for 6 can therefore proceed on the basis of 
pe | x) io pe | n, Tu) x p(n | The )p(9), 


for any choice of p(@),0 < 8 < 1. It is easily verified that 


p(n|r,.8) = (" x era ~yrr, n>Tn 
rp —1 
= Nb(7.] 8,72) 


(see Section 3.2.2), so that inferences in this case can be made as if we had adopted a 
negative-binomial parametric model. 

We note, incidentally, that whereas in the binomial case it makes sense to consider 
p(9) as specified over 0 < @ < 1, in the negative-binomial case it may only make sense to 
think of p(6) as specified over 0 < 9 < 1, since p(r, |9 = 0) = 0, forall r,, > 1. 

So far as prior to posterior inference for 8 is concerned, we note that, for any specified 
p(6), and assuming that either p(n | 9) = p(n) or p(r, | 9) = p(r,), we obtain 


p(O|21,.--,2n) = p(O|n, rn) « 0 (1 — 8)" p(8) 
since, considered as functions of 6, 


p(t, |n,8) x p(n} r,.0) x 8" (1 -ayrn™, 


The last part of the above example illustrates a general fact about the mecha- 
nism of parametric Bayesian inference which is trivially obvious; namely, for any 
specified p(@), if the likelihood functions p, (x, | @), p2(x2 | @) are proportional as 
functions of 0, the resulting posterior densities for 0 are identical, It turns out, 
as we shall see in Appendix B, that many non-Bayesian inference procedures do 
not lead to identical inferences when applied to such proportional likelihoods. The 
assertion that they should, the so-called Likelihood Principle, is therefore a con- 
troversial issue among Statisticians . In contrast, in the Bayesian inference context 
described above, this is a straightforward consequence of Bayes’ theorem, rather 
than an imposed “principle”. Note, however, that the above remarks are predicated 
on a specified p(@). It may be, of course, that knowledge of the particular sampling 
mechanism employed has implications for the specification of p(@), as illustrated, 
for example, by the comment above concerning negative-binomial sampling and 
the restriction to 0 < 6 < 1. 


250 5 Inference 


Although the likelihood principle is implicit in Bayesian statistics, it was devel- 
oped as a Separate principle by Barnard (1949), and became a focus of interest 
when Birnbaum (1962) showed that it followed from the widely accepted suffi- 
ciency and conditionality principles. Berger and Wolpert (1984/1988) provide an 
extensive discussion of the likelihood principle and related issues, Other relevant 
references are Barnard er al. (1962), Fraser (1963), Pratt (1965), Barnard (1967), 
Hartigan (1967), Bimbaum (1968, 1978), Durbin (1970), Basu (1975), Dawid 
(1983a)., Joshi (1983). Berger (1985b), Hill (1987) and Bayarri er a/. (1988). 


Example 5.1 illustrates the way in which ancillary statistics often arise nat- 
urally as a consequence of the way in which data are collected. In general. it is 
very often the case that the sample size, n, is fixed in advance and that inferences 
are automatically made conditional on 7, without further reflection. It is, however, 
perhaps not obvious that inferences can be made conditional on 1 if the latter has 
arisen as a result of such familiar imperatives as “stop collecting data when you feel 
tired”, or “when the research budget runs out”. The kind of analysis given above 
makes it intuitively clear that such conditioning is, in fact, valid, provided that the 
mechanism which has led to 7 “does not depend on 6”. This latter condition may. 
however, not always be immediately obviously transparent. and the following def- 
inition provides one version of a more formal framework for considering sampling 
mechanisms and their dependence on model parameters. 


Definition 5.2. (Stopping rule). A stopping rule, h, for (sequential) sampling 
from a sequence of observables x, © Xy.1y € Xo..... is a sequence of 
functions h,, : X\ x «+ x X,, — [0.4], such that, if @,,; = (t1...... ry) is 
observed, then sampling is terminated with probability h,,(a;,,,); otherwise, 
the (n + 1)th observation is made. A stopping rule is proper if the induced 
probability distribution py, (n).n = 1.02..... for final sample size guarantees 
that the latter is finite. The rule is deterministic if h,,(x,,,,) € {0.1} forall 
(1, 2(,))/ otherwise, it is a randomised stopping rule. 


In general, we must regard the data resulting from a sampling mechanism 
defined by a stopping rule A as consisting of (n.2;,,)). the sample size, together 
with the observed quantities .r)...... r,. A parametric model for these data thus 
involves a probability density of the form p(7.2;,,, |. @), conditioning both on 
the stopping rule (i.e., sampling mechanism) and on an underlying labelling pa- 
rameter @. But, either through unawareness or misapprehension. this is typically 
ignored and, instead. we act as if the actual observed sample size 77 had been fixed 
in advance, in effect assuming that 


P(N. Buy | RO) = p(Hin) |r. A) = plan) | A). 


using the standard notation we have hitherto adopted for fixed x. The important 
question that now arises is the following: under what circumstances, if any, can 
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we proceed to make inferences about @ on the basis of this (generally erroneous!) 
assumption, without considering explicit conditioning on the actual form of h? Let 
us first consider a simple example. 


Example 5.2. (“Biased” stopping rule for a Bernoulli sequence). Suppose, given 6, 
that x1,22z,... may be regarded as a sequence of independent Bernoulli random quantities 
with p(z; | @) = Bi(z; | @,1), «; = 0. 1, and that a sequential sample is to be obtained using 
the deterministic stopping rule h, defined by: 2, (1) = 1, A, (0) = 0, he(x1, 22) = 1 for all 
11, X. In other words, if there is a success on the first trial, sampling is terminated (resulting 
inn = 1, x; = 1); otherwise, two observations are obtained (resulting in either n = 2, 
x, = 0,22 =Oorn = 2,2, = 0, x2 = 1). 

At first sight, it might appear essential to take explicit account of h in making inferences 
about 9, since the sampling procedure seems designed to bias us towards believing in large 
values of 6. Consider, however, the following detailed analysis: 


p(n = 1,2, = 1[h,6) = p(x) = 1|n =1,h, 0)p(n = 1|h, 9) 
= 1- p(x, = 1|0) = p(x, = 1/9) 
and, for z = 0,1, 
p(n = 2,2, =0,22 = 2|h,6) = p(t, = 0,22 = 2 |n = 2,h,0)p(n = 2|h, 6) 
= p(x, = On = 2,h,6)p(zy = z| 2, = 0,n = 2,h,0)p(n = 2|h, 8) 

=1- p(x, = £| 2, = 0,A)p(r1 = 016) 

= p(x2 = 2,2; = 018). 
Thus, for ali (7, z(,)) having non-zero probability, we obtain in this case 

P(N, Zn) |b, A) = p(n) | 9), 


the latter considered pointwise as functions of @ (i.e., likelihoods). It then follows trivially 
from Bayes’ theorem that, for any specified p(9), inferences for @ based on assuming 7 to 
have been fixed at its observed value will be identical to those based on a likelihood derived 
from explicit consideration of h. 
Consider now a randomised version of this stopping rule which is defined by (1) = 7, 
h, (0) = 0, ho(21, 22) = 1 for all x,. 22. In this case, we have 
p(n = 1,24, =1/h.6) = p(x, = 1{n = 1,h,0)p(n = 1|h, 0) 
=1-7-p(r, = 1] 8), 
with, for x = 0,1, 
p(n =2,27, = 0,22 = r|h, 8) 
= p(n = 2|2, = 0,h, 9) 
x p(z) = Ol A, O)p(zz =z] 2) = 0,n = 2,h, 8) 
= 1- p(x, = 0|8)p(z2 = x6) 
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and 
pin = Qory = long = a l[h.@) = plu = 254) = LA Opin, = 1[h.4) 
x p(s =a fay = len = 2.h.8) 


= (1 — m)p(ry = 1 O)piry = 2 18). 
Thus, for all (7. 2,,,,) having non-zero probability, we again find that 
P(r. toy | RB) x plat) 


as functions of 4. so that the proportionality of the likelihoods once more implies identical 
inferences from Bayes’ theorem. for any given p(). 


The analysis of the preceding example showed, perhaps contrary to intuition, 
that, although seemingly biasing the analysis towards beliefs in larger values of 
9, the stopping rule does not in fact lead to a different likelihood from that of the 
a priori fixed sample size. The following, rather trivial, proposition makes clear 
that this is true for all stopping rules as defined in Definition 5.2, which we might 
therefore describe as “likelihood non-informative stopping rules”. 


Proposition 5.1. (Stopping rules are likelihood non-informative ). 
For any stopping rule h, for (sequential) sampling from a sequence of observ- 
ables xX), .2..... having fixed sample size parametric model p(x,,,,,n.@) = 


P( Lin) | 8), 
p(n. 2, |h.@) « p(x, |4). @€ 9. 
for all (1, 2) such that p(n. 2,,;|h.0) # 0. 


Proof. This follows straightforwardly on noting that 


nel 


P(N. Lip) | h, 0) => [Pen ain) Il (1 = h, (2,)) }p(an |@). 


el 
and that the term in square brackets does not dependon@.  g 
Again, it is a trivial consequence of Bayes’ theorem that. for any specified 
prior density. prior to posterior inference for @ given data (7. 2;,,,) obtained using 
a likelihood non-informative stopping rule h can proceed by acting as if a;,,, were 
obtained using a fixed sample size n. However, a notationally precise rendering of 
Bayes’ theorem, 
P(A | n, 2,,,).h) x p(n. 2, | h, @)p(@|h) 
x p(X.) |9)p(O | h). 
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reveals that knowledge of h might well affect the specification of the prior density! It 
is for this reason that we use the term “likelihood non-informative” rather than just 
“non-informative” stopping rules. It cannot be emphasised too often that, although 
it is often convenient for expository reasons to focus at a given juncture on one or 
other of the “likelihood” and “prior” components of the model, our discussion in 
Chapter 4 makes clear their basic inseparability in coherent modelling and analysis 
of beliefs. This issue is highlighted in the following example. 


Example 5.3. (“Biased” stopping rule for a normal mean ). Suppose, given @, that 
21.29..... may be regarded as a sequence of independent normal random quantities with 
p(x, |@) = N(2, | 6.1), 2; € R. Suppose further that an investigator has a particular concern 
with the parameter value @ = U and wants to stop sampling if F,, = }_, x,/n ever takes on 
a value that is “unlikely”, assuming # = 0 to be true. 

For any fixed sample size n, if “unlikely” is interpreted as “‘an event having probability 
less than or equal to c”, for small ev, a possible stopping rule, using the fact that p(Z, |n.@) = 
N(Z,, | 9.72), might be 


1, if [%, | > k(a)/Yn 
0, if |Z,| <kla)/Yn 


for suitable k(a) (for example, k = 1.96 for a = 0.05, & = 2.57 fora = 0.01, or k = 3.31 
for a = 0.001). It can be shown, using the law of the iterated logarithm (see, for example, 
Section 3.2.3), that this is a proper stopping rule, so that termination will certainly occur for 
some finite n, yielding data (7, 2;,,;). Moreover. defining 


h, (a1) = { 


k(a) 


S,, = {0 |Z, | < k(a). |Z2| s 


V2 
K(a) _ k(a) 
|Zn-a| < rat |z,| > “i : 


we have 
P(N, ®y) | A.A) = p(zo, | 2. A, A)p(n | h, A) 


= PL) | S,..9)p(Si |) 
= P(r) | @). 


as a function of @, for all (nm, 2,,,)) for which the left-hand side is non-zero. It follows that h 
is a likelihood non-informative stopping rule. 

Now consider prior to posterior inference for ?, where, for illustration, we assume the 
prior specification p(@) = N(@| j.A), with precision \ ~ 0, to be interpreted as indicating 
extremely vague prior beliefs about #, which take no explicit account of the stopping rule 
h. Since the latter is likelihood non-informative. we have 


P(A | Bay.) x play |r. 8)p(4) 
x p(Z, | 2, @)p() 
x N(%, |8.n)N(O| ye. A) 
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by virtue of the sufficiency of (n.¥,,) for the normal parametric model. The right-hand side 
is easily seen to be proportional to exp{ — 4Q(@)}, where 


nt, + Ay as 
To, : 


Q(@) = (v +h) f - 
which implies that 


ne, FA, 
eo ae (n+ ») 


p(@| az.) =N ¢ 


= N(O|F,,.07) 


for \ = 0. 
One consequence of this vague prior specification is that, having observed (n. x;,,,). 
we are led to the posterior probability statement 


po € (7. + Ho) fo] =l-«. 


But the stopping rule h ensures that |7,,! > A{a)/ 7. This means that the value 6 = 0 
certainly does not lie in the posterior interval to which someone with initially very vague 
beliefs would attach a high probability. An investigator knowing @ = () to be the true value 
can therefore, by using this stopping rule, mislead someone who, unaware of the stopping 
rule, acts as if initially very vague. 

However, let us now consider an analysis which takes into account the stopping rule. 
The nature of h might suggest a prior specification (6 | h) that recognises # = 0) as a 
possibly “special” parameter value, which should be assigned non-zero prior probability 
(rather than the zero probability resulting from any continuous prior density specification). 
As an illustration, suppose that we specify 


pla h) = lip. (9) “t qd = WT) bees (O)N(A! 0. Nu). 


which assigns a “spike” of probability, 7, to the special value, # ~ (0. and assigns 1 ~ 7 
times a N(@{(. A,,) density to the range 6 # 0). 

Since his a likelihood non-informative stopping rule and (77. *¥,,) are sufficient statistics 
for the normal parametric model. we have 


P| nw, hk) x NF, A. pe] h). 
The complete posterior p(@ | 1. 2,,,,. 2) is thus given by 


® lia. u(@N(F, [O.n) + (Ll ~ 7) een (ANCE, 1. 0N(O[0. Au) 
aN(F, {0.n) + (1 - 7) Hae N(¥,, 6. 2 )N(4 10. Ay )d0 


nt, 
—— tA). 
Ta ) 


= The w(A) + (1 = W )bwcoN ¢ 
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where, since 


i N(Z,, | @. n)N(@ | 0, Ay)d? = N (=. |, arg =) ; 


x 


it is easily verified that 


7 1~a N(Z,,|0,nAp(n + Av)7!) : 
mw =<1+ Se seeeeeeener ey Senet penne Seem 
tr N(Z,, | 0,72) 


= 14422 fee ae 1( nz, 1428 ‘ t 
= = i, exp 2 NIX 7 


The posterior distribution thus assigns a “spike” 7° to @ = 0 and assigns 1 — 1° times a 
N(6|(n + Ay) 'nF,,, 2 + Ay) density to the range @ # 0. 

The behaviour of this posterior density, derived from a prior taking account of h, is 
clearly very different from that of the posterior density based on a vague prior taking no 
account of the stopping rule. For qualitative insight, consider the case where actually 0 = 0 
and a has been chosen to be very small, so that /:(a) is quite large. In such a case, 7 is likely 
to be very large and at the stopping point we shall have Z, = k(a)//n. This means that 


Z l-n n\ 7? e 
vue = (1+2) exo(ieon| ~1. 


for large mn, so that knowing the stopping rule and then observing that it results in a large 
sample size leads to an increasing conviction that @ = 0. On the other hand, if @ is appreciably 
different from 0, the resulting rn, and hence 7“, will tend to be small and the posterior will 
be dominated by the N(@| (7 + Ag) 'nZ,, + Ao) component. 


5.1.5 Decisions and Inference Summaries 


In Chapter 2, we made clear that our central concern is the representation and 
revision of beliefs as the basis for decisions. Either beliefs are to be used directly in 
the choice of an action, or are to be recorded or reported in some selected form, with 
the possibility or intention of subsequently guiding the choice of a future action. 

With slightly revised notation and terminology, we recall from Chapters 2 and 3 
the elements and procedures required for Coherent, quantitative decision-making. 
The elements of a decision problem in the inference context are: 


(i) a € A, available “answers” to the inference problem; 
(ii) w € 2, unknown states of the world: 


(iii) u : A x Q — R, a function attaching utilities to each consequence (a, w) 
of a decision to summarise inference in the form of an “answer”, a, and an 
ensuing state of the world, w; 


256 5 Inference 


(iv) p(w), a specification, in the form of a probability distribution, of current beliets 
about the possible states of the world. 


The optimal choice of answer to an inference problem is ana € A which maximises 
the expected utility, 


| u(a,w)p(w) dw. 


Alternatively, if instead of working with u(@.w) we work with a so-called Joss 
Junction, 


l(a.w) = f(w) -— ula.w). 


where f is an arbitrary, fixed function, the optimal choice of answer is ana@ € A 
which minimises the expected loss, 


‘ita. w)p(w) dw 
Jo 


It is clear from the forms of the expected utilities or losses which have to be 
calculated in order to choose an optimal answer, that. if beliefs about unknown 
states of the world are to provide an appropriate basis for future decision making. 
where, as yet. A and u (or /) may be unspecified, we need to report the complete 
belief distribution p(w). 

However, if an immediate application to a particular decision problem, with 
specified A and u (or /), is all that is required, the optimal answer— maximising 
the expected utility or minimising the expected loss — may turn out to involve only 
limited, specific features of the belief distribution. so that these “summaries” of the 
full distribution suffice for decision-making purposes. 

In the following headed subsections, we shall illustrate and discuss some of 
these commonly used forms of summary. Throughout. we shall have in mind the 
context of parametric and predictive inference, where the unknown states of the 
world are parameters or future data values (observables). and current beliefs. p(w). 
typically reduce to one or other of the familiar forms: 


p(9) 
p(@|z) 
p(w | a) beliefs about y = g(@). given data x: 
pP(y|x) beliefs about future data y, given data z. 


initial beliefs about a parameter vector, @: 
beliefs about @, given data x: 
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Point Estimates 

In cases where w € 2 corresponds to an unknown quantity, so that Q is R, or R*, 
or Rt, or R x Rt, etc., and the required answer, a € 4A, is an estimate of the 
true value of w (so that A = {), the corresponding decision problem is typically 
referred to as one of point estimation. 

If w = 0 or w = a, we refer to parametric point estimation; if w = y, we 
refer to predictive point estimation. Moreover, since one is almost certain not to 
get the answer exactly right in an estimation problem, statisticians typically work 
directly with the loss function concept, rather than with the utility function. A 
point estimation problem is thus completely defined once A = 2 and l(a, w) are 
specified. Direct intuition suggests that in the one-dimensional case, distributional 
summaries such as the mean, median or mode of p(w) could be reasonable point es- 
timates of a random quantity w. Clearly, however, these could differ considerably, 
and more forma! guidance may be required as to when and why particular func- 
tionals of the belief distribution are justified as point estimates. This is provided 
by the following definition and result. 


Definition 5.3. (Bayes estimate). A Bayes estimate of w with respect to the 
loss function l(a, w) and the belief distribution p(w) isan a € A = 2 which 
minimises f., l(a, w)p(w) dw. 


Proposition 5.2. (Forms of Bayes estimates). 
(i) FA=2 = RK, l(a,w) = (a — w)'H(a — w), and H is symmetric 
definite positive, the Bayes estimate satisfies 
Ha = HE(w). 
If H™ exists, a = E(w), and so the Bayes estimate with respect to 
quadratic form loss is the mean of p(w), assuming the mean to exist. 
(ii) FA = Q = Rand I(a.w) = c)(a — w)\iu<ay(@) + Co(w — @)1(,50)(2), 
the Bayes estimate with respect to linear loss is the quantile such that 
Plw < a) = @/(c1 +). 
Ifcy = ca, the right-hand side equals 1/2 and so the Bayes estimate with 
respect to absolute value loss is a median of p(w). 
(ii) If A= 2 C KR and I(a,w) = 1 - 1¢p,(a))(w), where B-(a) is a ball 
of radius € in Q centred at a, the Bayes estimate maximises 


| p(w) dw. 
Bz (@) 


As — — 0, the function to be maximised tends to p(a@) and so the Bayes 
estimate with respect to zero-one loss is a mode of p(w), assuming a 
mode to exist. 
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Proof. Differentiating { (a — w)'H(a — w)p(w) dw with respect to a and 
equating to zero yields 


2H [c — w)p(w) dw = 0. 
This establishes (i). Since 


[ew)plw) aw =¢ i, ; Ss ~ w)p(w) dw + af (w — a)p(w) dw. 


{wa} 


differentiating with respect to a and equating to zero yields 


af p(w) dw = af p(w) dw. 
Kisea {w>a} 


whence, adding 2 f'., p(w) dw to each side, we obtain (ii). Finally. since 


J hacorplee) dio = 1 f ta. ()pleo) de 
and this is minimised when Spec p(w) dw is maximised, we have (iii). 


Further insight into the nature of case (iii) can be obtained by thinking of a 
unimodal, continuous p(w) in one dimension. It is then immediate by a continuity 
argument that a should be chosen such that 


p(a—¢) = p(a+é). 


In the case of a unimodal, symmetric belief distribution, p(w), for a single 
random quantity w, the mean, median and mode coincide. In general, for unimodal. 
positively skewed, densities we have the relation 


mean > median > mode 


and the difference can be substantial if p(u:) is markedly skew. Unless, therefore, 
there is a very clear need for a point estimate, and a strong rationale for a specific 
one of the loss functions considered in Proposition 5.2, the provision of a single 
number to summarise p(w) may be extremely misleading as a summary of the 
information available about w. Of course, such a comment acquires even greater 
force if p(w) is multimodal or otherwise “irregular”. 

For further discussion of Bayes estimators, see. for example. DeGroot and Rao 
(1963. 1966), Sacks (1963), Farrell (1964), Brown (1973), Tiao and Box (1974), 
Berger and Srinivasan (1978), Berger (1979, 1986), Hwang (1985, 1988), de la 
Horra (1987. 1988, 1992), Ghosh (1992a, 1992b), Irony (1992) and Spall and 
Maryak (1992). 
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Credible regions 


We have emphasised that, from a theoretical perspective, uncertainty about an 
unknown quantity of interest, w, needs to be communicated in the form of the full 
(prior, posterior or predictive) density, p(w), if formal calculation of expected loss 
or utility is to be possible for any arbitrary future decision problem. In practice, 
however, p(w) may be a somewhat complicated entity and it may be both more 
convenient, and also sufficient for general orientation regarding the uncertainty 
about w, simply to describe regions C C 2 of given probability under p(w). Thus, 
for example, in the case where 2 C R, the identification of intervals containing 
50%, 90%, 95% or 99% of the probability under p(w) might suffice to give a good 
idea of the general quantitative messages implicit in p(w). This is the intuitive basis 
of popular graphical representations of univariate distributions such as box plots. 


Definition 5.4. (Credible Region). A region C CQ such that 
[ p(w)dw=1l-a 
Cc 


is said to be a 100(1 — a)% credible region for w, with respect to p(w). 

If 2 © R, connected credible regions will be referred to as credible 
intervals. 

If p(w) is a (prior-posterior-predictive) density, we refer to (prior-pos- 
terior-predictive) credible regions. 


Clearly, for any given a there is not a unique credible region—even if we 
restrict attention to connected regions, as we should normally wish to do for obvious 
ease of interpretation (at least in cases where p(w) is unimodal). For given 2, 
p(w) and fixed a, the problem of choosing among the subsets C € 2 such that 
Jcp(w) dw = 1 — a could be viewed as a decision problem, provided that we are 
willing to specify a loss function, [(C, w), reflecting the possible consequences of 
quoting the 100(1 — a)% credible region C. We now describe the resulting form of 
credible region when a loss function is used which encapsulates the intuitive idea 
that, for given a, we would prefer to report a credible region C’ whose size ||C]| 
(volume, area, length) is minimised. 


Proposition 5.3. (Minimal size credible regions). Let p(w) be a probability 
density for w € 2 almost everywhere continuous; givena,0 < a < 1, if 
A={C; PweEC)=1-—a} #Band 


U(C,w) =k|C]]| -lcWw), CEA, wEeM, k>0, 


then C is optimal if and only if it has the property that p(w) > p(w) for all 
w) € C, we ¢ C (except possibly for a subset of 2 of zero probability). 
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Proof. It follows straightforwardly that, for any C € A. 


ip UC, w)p(w) dw = kI|C|| +1 —a. 
2 


so that an optimal C must have minimal size. 

If C has the stated property and D is any other region belonging to A, then since 
C= (CND)VU(CND’), D = (CND)U(CND) and Piw € C) = Plw € D). 
we have 


plw)ICM DI < | cea 
Jorpe 


inf 
weCnD 


2 [ peo) do< gap. pualiCn Dil 
JCD 


wend 


with 
s fw) << inf 
seUpe Pee = eee py Pw) 

so that [|C M D‘|| < [CM Dj. and hence ||C|| < |{D]|. 

If C does not have the stated property, there exists A € C such that for all 
w € A, there exists w. ¢ C such that p(w) > p(w,). Let B C C° be such that 
P(w € A) = P(w € B) and p(w») > p(w,) for all w, € Band w, € A. Define 
D=(CMA‘)UB. Then D € A and by a similar argument to that given above 
the result follows by showing that |[D]| < {/C|].  g 


The property of Proposition 5.3 is worth emphasising in the form of a definition 
(Box and Tiao, 1965). 


Definition 5.5. (Highest probability density (HPD) regions). 
AregionC © Qis said to be a 100(1 — a) % highest probability density region 
for w with respect to p(w) if 

(i) PwEC)=1l-a 

(ii) p(w) > p(w.) forallw, € C and w2 ¢ C. except possibly for a sub- 

set of QD having probability zero. 

If p(w) is a (prior-posterior-predictive) density, we refer to highest (prior- 
posterior-predictive) density regions. 


Cleariy, the credible region approach to summarising p(w) is not particularly 
useful in the case of discrete (2, since such regions will only exist for limited choices 
of a. The above development should therefore be understood as intended for the 
case of continuous 2. 

For a number of commonly occurting univariate forms of p(w), there exist 
tables which facilitate the identification of HPD intervals for a range of values of « 
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Wa Cc 


Figure 5.la wy almost as “plausible” as allu € C 


Wo C 
Figure 5.1b wy much less “plausible” than most w € C 


(see, for example, Isaacs et al., 1974, Ferrandiz and Sendra,1982, and Lindley and 
Scott, 1985). 


In general, however, the derivation of an HPD region requires numerical cal- 
culation and, particularly if p(w) does not exhibit markedly skewed behaviour, it 
may be satisfactory in practice to quote some more simply calculated credible re- 
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gion. For example, in the univariate case, conventional statistical tables facilitate 
the identification of intervals which exclude equi-probable tails of p(w) for many 
standard distributions. 

Although an appropriately chosen selection of credible regions can serve to 
give a useful summary of p(w) when we focus just on the quantity w. there is 
a fundamental difficulty which prevents such regions serving, in general. as a 
proxy for the actual density p(w). The problem is that of lack of invariance under 
parameter transformation. Even if v = g(w) is a one-to-one transformation, it 
is easy to see that there is no general relation between HPD regions for w and v. 
In addition, there is no way of identifying a marginal HPD region for a (possibly 
transformed) subset of components of w from knowledge of the joint HPD region. 

In cases where an HPD credible region C’ is pragmatically acceptable as a 
crude summary of the density p(w), then, particularly for small values of a (for 
example, 0.05, 0.01), a specific value wo € 2 will tend to be regarded as somewhat 
“implausible” if wy ¢ C. This, of course, provides no justification for actions 
such as “rejecting the hypothesis that w = w,”. If we wish to consider such 
actions, we must formulate a proper decision problem, specifying alternative actions 
and the losses consequent on correct and incorrect actions. Inferences about a 
specific hypothesised value w, of a random quantity w in the absence of alternative 
hypothesised values are often considered in the general statistical literature under 
the heading of “significance testing”. We shall discuss this further in Chapter 6. 

For the present, it will suffice to note — as illustrated in Figure 5.1 —that even 
the intuitive notion of “implausibility if w. ¢ C” depends much more on the 
complete characterisation of p(w) than on an either-or assessment based on an 
HPD region. 

For further discussion of credible regions see, for example. Pratt (1961). 
Aitchison (1964, 1966), Wright (1986) and DasGupta (1991). 


Hypothesis Testing 
The basic hypothesis testing problem usually considered by statisticians may be 
described as a decision problem with elements 


Q= {wo = [Hy : 8 € Ou). uw} = (MH, :0EOy]}. 


together with p(w), where @ € @ = OyUO)., is the parameter labelling a parametric 
model, p(x |@), A = {ay.a1}. with a,(a 9) corresponding to rejecting hypothesis 
H( 1), and loss function I(a,.w,) = L,. 7.7 € {0.1}, with the J,, reflecting the 
relative seriousness of the four possible consequences and, typically. log = /y; = 0. 

Clearly, the main motivation and the principal use of the hypothesis testing 
framework is in model choice and comparison, an activity which has a somewhat 
different flavour from decision-making and inference within the context of an ac- 
cepted model. For this reason, we shall postpone a detailed consideration of the 
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topic until Chapter 6, where we shall provide a much more general perspective on 
model choice and criticism. 

General discussions of Bayesian hypothesis testing are included in Jeffreys 
(1939/1961), Good (1950, 1965, 1983), Lindley (1957, 1961b, 1965, 1977), Ed- 
wards et al. (1963), Pratt (1965), Smith (1965), Farrell (1968), Dickey (1971, 1974, 
1977), Lempers (1971), Rubin (1971), Zellner (1971), DeGroot (1973), Leamer 
(1978), Box (1980), Shafer (1982b), Gilio and Scozzafava (1985), Smith, (1986), 
Berger and Delampady (1987), Berger and Sellke (1987) and Hodges (1990, 1992). 


5.1.6 Implementation Issues 


Given a likelihood p(x | 6) and prior density p(@), the starting point for any form 
of parametric inference summary or decision about @ is the joint posterior density 


_ _ re |8)p(8)_ 
J p(w |0)p(6)d6 


and the starting point for any predictive inference summary or decision about future 
observables y is the predictive density 


p(6|z) 


p(y |) = ‘| p(y| 6)p(0|x) dB. 


It is clear that to form these posterior and predictive densities there is a technical 
requirement to perform integrations over the range of 8. Moreover, further sum- 
marisation, in order to obtain marginal densities, or marginal moments, or expected 
utilities or losses in explicitly defined decision problems, will necessitate further 
integrations with respect to components of 9 or y, or transformations thereof. 

The key problem in implementing the formal Bayes solution to inference re- 
porting or decision problems is therefore seen to be that of evaluating the required 
integrals. In cases where the likelihood just involves a single parameter, implemen- 
tation just involves integration in one dimension and is essentially trivial. However, 
in problems involving a multiparameter likelihood the task of implementation is 
anything but trivial, since, if @ has k components, two k-dimensional integrals are 
required just to form p(@ | a) and p(y|a). Moreover, in the case of p(@| x), for 
example, k (k — 1)-dimensional integrals are required to obtain univariate marginal 
density values or moments, (5) (k — 2)-dimensional integrals are required to obtain 
bivariate marginal densities, and so on. Clearly. if k is at all large, the problem of 
implementation will, in general, lead to challenging technical problems, requiring 
simultaneous analytic or numerical approximation of anumber of multidimensional 
integrals. 

The above discussion has assumed a given specification of a likelihood and 
prior density function. However, as we have seen in Chapter 4, although a spe- 
cific mathematical form for the likelihood in a given context is very often implied 
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or suggested by consideration of symmetry, sufficiency or experience, the math- 
ematical specification of prior densities is typically more problematic. Some of 
the problems involved—such as the pragmatic strategies to be adopted in translat- 
ing actual beliefs into mathematical form— relate more to practical methodology 
than to conceptual and theoretical issues and will be not be discussed in detail in 
this volume. However, many of the other problems of specifying prior densities 
are closely related to the general problems of implementation described above, as 
exemplified by the following questions: 


(i) given that, for any specific beliefs. there is some arbitrariness in the precise 
choice of the mathematical representation of a prior density. are there choices 
which enable the integrations required to be carried out straightforwardly 
and hence permit the tractable implementation of a range of analyses, thus 
facilitating the kind of interpersonal analysis and scientific reporting referred 
to in Section 4.8.2 and again later in 6.3.3? 

(ii) if the information to be provided by the data is known to be far greater than 
that implicit in an individual's prior beliefs. is there any necessity for a precise 
mathematical representation of the latter. or can a Bayesian implementation 
proceed purely on the basis of this qualitative understanding? 

(iii) either in the context of interpersonal analysis, or as a special form of actual 
individual analysis, is there a formal way of representing the beliefs of an 
individual whose prior information is to be regarded as minimal, relative to 
the information provided by the data? 


(iv) for general forms of likelihood and prior density, are there analytic/numerical 
techniques available for approximating the integrals required for implementing 
Bayesian methods? 

Question (i) will be answered in Section 5.2, where the concept of a conjugate 
prior density will be introduced. 

Question (ii) will be answered in part at the end of Section 5.2 and in more detail 
in Section 5.3, where an approximate “large sample” Bayesian theory involving 
asymptotic posterior normality will be presented. 

Question (iii) will be answered in Section 5.4. where the information-based 
concept of a reference prior density will be introduced. An extended historical 
discussion of this celebrated philosophical problem of how to represent “ignorance” 
will be given in Section 5.6.2. 

Question (iv) will be answered in Section 5.5, where classical applied anal- 
ysis techniques such as Laplace’s approximation for integrals will be briefly re- 
viewed in the context of implementing Bayesian inference and decision summaries, 
together with classical numerical analytical techniques such as Gauss-Hermite 
quadrature and stochastic simulation techniques such as importance sampling. 
sampling-importance-resampling and Markov chain Monte Carlo. 
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5.2 CONJUGATE ANALYSIS 
5.2.1 Conjugate Families 


The first issue raised at the end of Section 5.1.6 is that of tractability. Given a 
likelihood function p(a | @), for what choices of p(@) are integrals such as 


p(x) = / p(x|6)p(6)d@ and —p(y|x) = / p(y |6)p(6 | x)a0 


easily evaluated analytically? However, since any particular mathematical form of 
p(9) is acting as a representation of beliefs —either of an actual individual, or as 
part of a stylised sensitivity study involving a range of prior to posterior analyses — 
we require, in addition to tractability, that the class of mathematical functions from 
which p(@) is to be chosen be both rich in the forms of beliefs it can represent and 
also facilitate the matching of beliefs to particular members of the class. Tractability 
can be achieved by noting that, since Bayes’ theorem may be expressed in the form 


p(8 |x) x p(x | @)p(8). 


both p(6 | a) and p(@) can be guaranteed to belong to the same general family of 
mathematical functions by choosing p(@) to have the same “structure” as p(x | 0), 
when the latter is viewed as a function of 6. However, as stated, this is a rather 
vacuous idea, since p(@|a) and p(@) would always belong to the same “general 
family” of functions if the latter were suitably defined. To achieve a more mean- 
ingful version of the underlying idea, let us first recall (from Section 4.5) that if 
t = t(z) is a sufficient statistic we have 


p(O| x) = p(O|t) x p(t| @)p(9), 


so that we can restate our requirement for tractability in terms of p(@) having the 
same structure as p(t|@), when the latter is viewed as a function of 8. Again, 
however, without further constraint on the nature of the sequence of sufficient 
statistics the class of possible functions p(@) is too large to permit easily interpreted 
matching of beliefs to particular members of the class. This suggests that it is only 
in the case of likelihoods admitting sufficient statistics of fixed dimension that we 
shall be able to identify a family of prior densities which ensures both tractability 
and ease of interpretation. This motivates the following definition. 


Definition 5.6. (Conjugate prior family). The conjugate family of prior den- 
sities for @ € ©, with respect to a likelihood p(x | @) with sufficient statistic 
= t(x) = {n, 8(x)} of a fixed dimension k independent of that of x, is 


{p(O|r), 7 = (7), T1,- +++ Te) E T}, 
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where 
T= {rf p(s =(T1..-., 7m) [On = H)dO < x} 
e 


and 
p(s = (%..... 7%) [O.n = TM) 


Jy p(s = (n1..... 7) |0.n = 7)d0- 


From Section 4.5 and Definition 5.6, it follows that the likelihoods for which 
conjugate prior families exist are those corresponding to genera! exponential family 
parametric models (Definitions 4.10 and 4.11), for which, given f, h, @ and c. 


p(8|T) = 


k 
p(z| 8) = f(z)9(8) exp {Xcaeiance} . eX. 


k 
(vi) = f Heaven | Seai0ynteyh ae 
. i=) 


The exponential family model is referred to as regular or non-regular, respectively, 
according as X does not or does depend on @. 


Proposition 5.4. (Conjugate families for regular exponential families). If 
w@ = (21,....2n) is @ random sample from a regular exponential family 
distribution such that 


n k n 
p(w |6) = [J f(2,) {9(6)]" exp e ¢,0,(8) (> nie] 
j=) r=] y=\ 


then the conjugate family for 0 has the form 


k 


HO In) =[K(r)}*90)/ exo | colon. 0€O. 
j=l 


where T is such that K(r) = f,,{9(@)]° exp fo C, 6,(@)n} dd< x. 


Proof. By Proposition 4.10 (the Neyman factorisation criterion), the sufficient 
statistics for @ have the form 
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so that, for any T = (79,71,--.,7n) Such that [, p(@|7)d@ < oo, a conjugate 
prior density has the form 


p(9 |r) x p(si(x) =T1,-.-: 54(@) = Te |8, n= To) 


x [9(8)] exp pz corn} 


i=1 


by Proposition 4.2.  g 


Example 5.4. (Bernoulli likelihood; beta prior). The Bernoulli likelihood has the 
form 


P(21..--ty[9) = [Tera ay (0S 8S 1) 
=] 


= (1 - aye {oe (7 i) da}. 


so that, by Proposition 5.4, the conjugate prior density for @ is given by 


p(8|m,71) & (2 ~ 0) exp {tog (cS) n} 


1 

= =A - gyno, 
K(t%|, 

assuming the existence of (70.7) 


1 
K(t%. 71) = [ or _ gyor7 dé. 
0 


Writing « = 7, + 1, and 3 = tT) — 7, + 1, we have p(O | a3) x 6° '(1—6)?-’, and hence, 
comparing with the definition of a beta density, 


P(O| 7.71) = p(@ | a, 8) = Be(O | cx. 33), a> 3g>0. 


Example 5.5. (Poisson likelihood; gamma prior). The Poisson likelihood has the 
form 


ape ele poe (6 > 0) 


a 
poe Dj: 


nu -t n 
= (1 z; ' exp(—n@) exp (1 @ a =) H 
i=l 


t=] 


so that, by Proposition 5.4, the conjugate prior density for @ is given by 
pP(9| ty. 71) x exp(—7HA) exp(7) log ) 


= Kea" exp(— —T 6), 
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assuming the existence of 
A(t. Tr) =f 4" exp(—1%0) dé. 
it 


Writing a = 7; +1 and 3 = 7, we have p(@{ rx. .3) x "| exp(—.36) and hence, comparing 
with the definition of a gamma density, 


P(A | tT). 7) = p(@ fa. 3) = Ga(@| a. 3). a> 3>0. 


Example 5.6. (Normal likelihood; normal-gamma prior). The normal likelihood. 
with unknown mean and precision, has the form 


Ly ee: A e 
P(r... tn |eA) = TT] on) fx card Cita 


r=] 


» » Xr ay . uw Xr ~ » 
= (2n)7"" [a “exp (-37)| exp don 25 >: | ‘ 


‘ ont 


so that, by Proposition 5.4, the conjugate prior density for 6 = (ji. \) is given by 


4 I 3 H 1 
P(LXT To. 7). 72) X [a%exp (-3)| exp {ar _ an} 


1 try Lee r, Te j At» “AN? 
a ae ICE | 2 ey seis 17d ee At tt . 
K(t%. 7.72) ‘ or 2 ("2 Ty ) aay > \F Ta 


assuming the existence of A'(7). 7). 72), given by 


el ceige A Sheep ete 2) baal 
if 73 exp ( are eee NEP “FS ae dyed da. 


Writingar = $(m+1).9= A(t - 1), 1) = 7;/T), and comparing with the definition of a 
normal-gamma density, we have 
PUL ALT |. 7). T2) = pl. Afra.) 
= Ng(ye. Apa. 3.4) 
a N(117-A(20 s 1))Ga(A a. 3). 


witha > 5.3 >GFVER 


wie 
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5.2.2 Canonical Conjugate Analysis 


Conjugate prior density families were motivated by considerations of tractability 
in implementing the Bayesian paradigm. The following proposition demonstrates 
that, in the case of regular exponential family likelihoods and conjugate prior densi- 
ties, the analytic forms of the joint posterior and predictive densities which underlie 
any form of inference summary or decision making are easily identified. 


Proposition 5.5. (Conjugate analysis for regular exponential families). 
For the exponential family likelihood and conjugate prior density of Proposi- 
tion 5.4: 

(i) the posterior density for 0 is 


where p(O}x, Tr) = p(O|r + t,,(x)) n 
T+t,(z) = (x +n, T+ > h(x): wee ThA PS ite) 
jel jel 


(ii) the predictive density for future observables y = (yi,---+ Ym) is 
p(y|x.7) = ply| 7 + t,(x)) 


m 


- K(r + t,(xz) + tn(y)) 
Thy u) K(r + t,(z)) 


where tm(y) = [m, =m ha eee Yor Aa (y)). 
Proof. By Bayes’ theorem, 
p(O|x, 7) x p(x| 8)p(8| 7) 


k 
x [g9(@)]°*” exp >» cid, (O 1c + g hj (xj ) ! 


i=] 
x p(8 | T+ t,(x)), 


which proves (i). Moreover, 


pty|2.7) =f o(y|8)p(012)29 
=|] s (yt) )- [K( 7 +t,(z))|"! [we ae 
l=} 


x exp Aci (+3 + % hi(xj) + > sn) dé 
= = t=] 


" (r i tn(x) + tn(y)) 
=I] f(y) (r + t,(x)) 


which proves a 4 
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Proposition 5.5(i) establishes that the conjugate family is closed under sam- 
pling, with respect to the corresponding exponential family likelihood, a concept 
which seems to be due to G. A. Barnard. This means that both the joint prior 
and posterior densities belong to the same, simply detined, family of distributions, 
the inference process being totally defined by the mapping 7 — (7 + t,,(x)). 
under which the labelling parameters of the prior density are simply modified by 
the addition of the values of the sufficient statistic to form the labelling parameter 
of the posterior distribution. The inference process defined by Bayes’ theorem is 
therefore reduced from the essentially infinite-dimensional problem of the transfor- 
mation of density functions, to a simple, additive finite-dimensional transformation. 
Proposition 5.5(ii) establishes that a similar. simplifying closure property holds for 
predictive densities. 

The forms arising in the conjugate analysis of a number of standard exponential 
family forms are summarised in Appendix A. However, to provide some preliminary 
insights into the prior — posterior — predictive process described by Proposition 
5.5, we shall illustrate the general results by reconsidering Example 5.4. 


Example 5.4. (continued). With the Bernoulli likelihood written in its explicit expo- 
nential family form, and writing r,, = .v, +--+ +.1,,. the posterior density corresponding to 
the conjugate prior density, p(@ | 7. 7,), is given by 


P(A | 2. m7) x p(e|O)p( | mH. 71) 


x (1 — 6)" exp {ox (5) ra} (lL - 6)" 
ai di é 
x eXP § log 1.8 7) 


~ P(r (nr) + IT (ma(n) -— 71 (nr) +7) 


x exp {log (; ~;) nin) 


where 7 (7) = Tu tn, 71 (2) = 7 +1r,,, showing explicitly how the inference process reduces 
to the updating of the prior to posterior hyperparameters by the addition of the sufficient 
Statistics, n and r,,. 

Alternatively, we could proceed on the basis of the original representation of the Ber- 
noulli likelihood, combining it directly with the familiar beta prior density, Be(@ | a..4). so 
that 


(L- gy 


p(O |x. a. .3) x p(x | @)p(4 | a..3) 
x Or (h —@yrg are — 6)! 4 
Pon + Bn) 1 fn 
oe git — Bye T, 
Mang) 


where a, =O +7,,.33, = 3+ —- 1, and, again, the process reduces to the updating of the 
prior to posterior hyperparameters. 


5.2 Conjugate Analysis 271 


Clearly, the two notational forms and procedures used in the example are 
equivalent. Using the standard exponential family form has the advantage of dis- 
playing the simple hyperparameter updating by the addition of the sufficient statis- 
tics. However, the second form seems much less cumbersome notationally and is 
more transparently interpretable and memorable in terms of the beta density. 

In general, when analysing particular models we shall work in terms of what- 
ever functional representation seems best suited to the task in hand. 


Example 5.4. (continued ). Instead of working with the original Bernoulli likelihood, 
p(z).....2,,|@), we could, of course, work with a likelihood defined in terms of the sufficient 
statistic (n,r,,). In particular, if either n or r,, were ancillary, we would use one or other of 
P(r, [7,0) or p(n|r,.@) and, in either case, 


PO | rary. a, B) x O"(1 — BYE" "ED — AYA! 


Taking the binomial form, p(r,,|7.6), the prior to posterior operation defined by Bayes’ 
theorem can be simply expressed, in terms of the notation introduced in Section 3.2.4, as 


Bi(r,, |9,n) @ Be(@|a, 3) = Be(Olat+r,.3a+n-71,). 
The predictive density for future Bernoulli observables, which we denote by 
Y= (Yr Yn) = (nate... Lene), 
is also easily derived. Writing r/, = y, +---+ Yin, we see that 
P(y |, a. 3) = p(y | a, Bn) 
= [ “ly | 8)P(8 | an, 8y) 48 


_ Plan + Bn) 

~ Pan )P (By) 

a Ta, + 3) P(eugm EP (Gi 4m) F 
Tan )P(3,) PQnam + Sutin) 


I gontrin VY = gyi terry} d@ 
i) 


where 
/ U 
Onem = An * Tn =at Tr + Vino 


Buem = 3, + m — On ‘a 3 + (n + m) - (Tp a nds 


a result which also could be obtained directly from Proposition 5.5(ii). 
If, instead, we were interested in the predictive density for 7/,,, iteasily follows that 


mm 


1 
PUT lan, Dus m) = [ P(r, | i, 8)p(0 { Any 3) dé 
LU 


= E ("" )ptw6)9(0 0.4) a8 


m 


= ("ow | On. By). 


m 
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Comparison with Section 3.2.2 reveals this predictive density to have the binomial-beta form, 
Bb(r’, | ce, 3,. m2). 

The particular case m = | is of some interest, since p(r),, = Lia,..3,.1m = 1) is then 
the predictive probability assigned to a success on the (77 + 1)th trial. given r,, observed 
successes in the first n trials and an initial Be(@| a. .3) belief about the limiting relative 
frequency of successes, 4. 

We see immediately, on substituting into the above. that 
a 


—— =~ (Aj a,,. 3,). 


r,, = lla, 3.00 = 1) =—— 
pi, Ja m ) pcan 


using the fact that [(¢ + 1) = ¢1(t} and recalling. from Section 3.2.2, the form of the mean 
of a beta distribution. 

With respect to quadratic loss, E(@|a,...4,) = (a +4,)/(@ + 44 1) is the optimal 
estimate of 9 given current information, and the above result demonstrates that this should 
serve as the evaluation of the probability of a success on the next trial. Inthe case a + .4 = 1 
this evaluation becomes (r,, + 1)/(1 + 2), which is the celebrated Laplace's rule of succes- 
sion (Laplace, 1812), which has served historically to stimulate considerable philosophical 
debate about the nature of inductive inference. We shall consider this problem further in 
Example 5.16 of Section 5.4.4. For an elementary, but insightful. account of Bayesian 
inference for the Bernoulli case, see Lindley and Phillips (1976). 


In presenting the basic ideas of conjugate analysis, we used the following 
notation for the k-parameter exponential family and corresponding prior form: 


h 
p(r}@) = f(r)g(@) exp 1 col@nin} . veX, 


ot} 
and ‘ 
pO |r) = [K(7)] (war exo {¥-o10rn}. 069. 
a= 


the latter being defined for 7 such that A(7} < x. 

From a notational perspective (cf. Definition 4.12), we can obtain considerable 
simplification by defining a = (41..... ty). Y= (Ypres yz), where &, = 
c0@,(@) and y, = h(z,),i = 1,....4, together with prior hyperparameters ry. yy. 
so that these forms become 


P(y|) = a(y)exp {y'p —W(w)}. ye ¥. 
PCW | ro. Yo) = E(t. Yo) exp {royow — nob(P)}. Pew. 
for appropriately defined Y , Y and real-valued functions a, b and c. We shall refer 


to these (Definition 4.12) as the canonical (or natural) forms of the exponential 
family and its conjugate prior family. If ¥ = R*, we require mp > OL yo € Y 
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in order for p(w | 70, ¥g) to be a proper density; for ¥ # R*, the situation is 
somewhat more complicated (see Diaconis and Ylvisaker, 1979, for details). We 
shall typically assume that © consists of all such that f. p(y|)dy = 1 and 
that b(y) is continuously differentiable and strictly convex throughout the interior 
of ©, 

The motivation for choosing no, yo as notation for the prior hyperparameter 
is partly clarified by the following proposition and becomes even clearer in the 
context of Proposition 5.7. 


Proposition 5.6. (Canonical conjugate analysis). If y,,...,y, are the val- 
ues of y resulting from a random sample of size n from the canonical expo- 
nential family parametric model, p(y |), then the posterior density corre- 
sponding to the canonical conjugate form, p(w | no, yo), is given by 


MYo + NY, 
P(W | No, Yor is---+ Yn) =»(w “ath me 


(n+ 7) 
where Gn = 1 vi/" 
Proof. 


no +n, 


PW] M0 Yor Yrs ---1Yn) & | [plus | b)p(w | ro, yo) 


i=l 


ox exp {ny — nb(w)} 
x exp {noyph — nob(w) } 


x exp {(no¥o + NUn)'h — (no + r)d(p)}. 
and the result follows. 


Example 5.4. (continued). In the case of the Bernoulli parametric model, we have 
seen earlier that the pairing of the parametric model and conjugate prior can be expressed as 


p(xz|0) = (1 - 6) exp {og (4)} 
p(@|.71) = [K(r)}7( ~ @y¥exp {nog (525) 


The canonical forms in this case are obtained by setting 
é . 
pe) np ibe (<5) aly) =1, BY) = logit +e"), 
P(r + 2) 


e{no, Ho) = P(noyo + 1)P(no — Naya + 1) , 


and, hence, the posterior distribution of the canonical parameter w is given by 


NyYo + NY, . ; 
P(W| No, Yo. Mi-- +s Yn) K exp G + n)4 piss ered LF y- xo) ; 
n+ No 
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Example 5.5. (continued ). In the case of the Poisson parametric model. we have 
seen earlier that the pairings of the parametric model and conjugate form can be expressed 
as 


p(x |@) = + exp(—@) exp(r log 8) 
P(O| Tm. 71) = [A (7)) | exp(— m8) exp(7 log 4). 


The canonical forms in this case are obtained by setting 


1 ; 
your. w=logé aly) = ri © bleyset. e{nte. yu) = Tum : 


The posterior distribution of the canonical parameter ¢' is now immediately given by Propo- 
sition 5.6. 


Example 5.6. (continued ). \n the case of the normal parametric model, we have seen 
earlier that the pairings of the parametric model and conjugate form can be expressed as 


pla | pod) = (2r)7'? [A exp (-5")| exp {09 _ set 


2 i] ] 
p(w. Alta TT) = [A (7)! [* "exp (-5)| exp { an) - sal . 


The canonical forms in this case are obtained by setting 


Y= (tele) = (tar). P= (ey. ey) 


I 
So 
= 

— 

t 
Whe 
» 
ee al 


a(y) = (2m) '?. (xb) = log(-2e,)" 1? - oh. 


: 2r - (4(0eugua)) nn" : 
C(t. Yo) = ( ) Tr (tice iy) ; 


Again, the posterior distribution of the canonical parameters ¢ = (¢'). v2) is now immedi- 
ately given by Proposition 5.6. 


ny 


For specific applications, the choice of the representation of the parametric 
model and conjugate prior forms is typically guided by the ease of interpretation 
of the parametrisations adopted. Example 5.6 above suffices to demonstrate that 
the canonical forms may be very unappealing. From a theoretical perspective. 
however, the canonical representation often provides valuable unifying insight. as 
in Proposition 5.6, where the economy of notation makes it straightforward to 
demonstrate that the learning process just involves a simple weighted average. 

NoYo + NY, 
N+trn 


of prior and sample information. Again using the canonical forms. we can give a 
more precise characterisation of this weighted average. 
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Proposition 5.7. (Weighted average form of posterior expectation). 
If yy,.--. Yn are the values of y resulting from a random sample of size n 
Jrom the canonical exponential family parametric model, 


p(y |) = a(y) exp {y'h — (p)}, 
with canonical conjugate prior p(w | no, yo), then 
E [Vb(wp) | ™, Yo: y] = TY n + (i a T)Yo: 
where 


n 
= i b = ~—b(p). 
r=". [VH()], = HW) 

Proof. By Proposition 5.6, it suffices to prove that E(Vb(w) | no. vo) = Yo- 
But 


na[uo — E(VO() |r. ¥0)] = ro (Bp — V0()) D(a | 0, 40) ded 


= [ Vol no.) dv. 
This establishes the result. g 


Proposition 5.7 reveals, in this natural conjugate setting, that the posterior 
expectation of Vb(a), that is its Bayes estimate with respect to quadratic loss 
(see Proposition 5.2), is a weighted average of yo and ¥,. The former is the prior 
estimate of V.b(~); the latter can be viewed as an intuitively “natural” sample-based 
estimate of Vb(w), since 


E(y|v) ~ Vow) = [ (w— Vol) P(u | Wy 
= | voll v) dy =v | rlylvdy =0 


and hence E(y|) = E(Y, |b) = Vo(p). 

For any given prior hyperparameters, (72, Yo), as the sample size n becomes 
large, the weight, 7, tends to one and the sample-based information dominates the 
posterior. In this context, we make an important point alluded to in our discussion of 
“objectivity and subjectivity”, in Section 4.8.2. Namely, that in the stylised setting 
of a group of individuals agreeing on an exponential family parametric form, but 
assigning different conjugate priors, a sufficiently large sample will lead to more 
or less identical posterior beliefs. Statements based on the fatter might well, in 
common parlance, be claimed to be “objective”. One should always be aware, 
however, that this is no more than a conventional way of indicating a subjective 
consensus, resulting from a large amount of data processed in the light of a central 
core of shared assumptions. 
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Proposition 5.7 shows that conjugate priors for exponential family parameters 
imply that posterior expectations are linear functions of the sufficient statistics. 
It is interesting to ask whether other forms of prior specification can also lead to 
linear posterior expectations. Or. more generally, whether knowing or constrain- 
ing posterior moments to be of some simple algebraic form suffices to characterise 
possible families of prior distributions. These kinds of questions are considered 
in detail in, for example. Diaconis and YIvisaker (1979) and Goel and DeGroot 
(1980). In particular, it can be shown, under some regularity conditions. that. 
for continuous exponential families, linearity of the posterior expectation does 
imply that the prior must be conjugate. 


The weighted average form of posterior mean, 


NHYo + NY, 


E{VO() |. yyy] = 


obtained in Proposition 5.7, and also appearing explicitly in the prior to posterior 
updating process given in Proposition 5.6 makes clear that the prior parameter, 72%). 
attached to the prior mean, Yo for Vb(2b), plays an analogous role to the sample 
size, n, attached to the data mean ¥,,. The choice of an y which is large relative to 
n thus implies that the prior will dominate the data in determining the posterior (see, 
however, Section 5.6.3 for illustration of why a weighted-average form might not 
be desirable). Conversely, the choice of an mo which is small relative to n ensures 
that the form of the posterior is essentially determined by the data. In particular. 
this suggests that a tractable analysis which “lets the data speak for themselves” can 
be obtained by letting my — 0. Clearly, however. this has to be regarded as simply a 
convenient approximation to the posterior that would have been obtained from the 
choice of a prior with small, but positive 7%). The choice 7) = 0 typically implies 
a form of p(w | no. yo) which does not integrate to unity (a so-called improper 
density) and thus cannot be interpreted as representing an actual prior belief. The 
following example illustrates this use of limiting, improper conjugate priors in the 
context of the Bernoulli parametric model with beta conjugate prior, using standard 
rather than canonical forms for the parametric models and prior densities. 


Example 5,4. (continued). We have seen that ifr, = wr) ~ -+- + 7, denotes the 
number of successes in 7: Bernoulli trials, the conjugate beta prior density. Be(# | a. .3), for 
the limiting relative frequency of successes, 9, leads toa Be(@ja+r,..3 + —1,,) posterior 
for 9, which has expectation 


ater, Ry a 
————- =n7(—]J4+(l-3 . 
atstn «(7) ( 7(—*5) 


where x = (a +,3+ 2) 'n, providing a weighted average between the prior mean tor @ and 
the frequency estimate provided by the data. In this notation. 7, — Q corresponds toa — U. 
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3 — 0, which implies a Be(@ | r,,, 2 —r,,) approximation to the posterior distribution, having 
expectation r,,/n. The limiting prior form, however, would be 


p(@|a=0,3 =0) xo'(1-@)"!" 


which is not a proper density. As a technique for arriving at the approximate posterior 
distribution, it is certainly convenient to make formal use of Bayes’ theorem with this 
improper form playing the role of a prior, since 


PO la =0.3 =O.n.7,) x p(y |[nO)p( | a = 0,3 = 0) 
xoe(1—ay a "(1 — 6) ! 


x Be(@|r..2—1,). 


It is important to recognise, however, that this is merely an approximation device and 
in no way justifies regarding p(@|a = 0.3 = 0) as having any special significance as a 
representation of “prior ignorance”. Clearly, any choice of a, {3 small compared with r,,, 
n-~—r,, (for example, a = 3 = 5 or a = 3 = 1 for typical values of r,,, — 7,,) will lead to 
an almost identical posterior distribution for @. 

A further problem of interpretation arises if we consider inferences for functions of 0. 
Consider, for example, the choice a = 3 = 1, which implies a uniform prior density for 
6. At an intuitive level, it might be argued that this represents “complete ignorance” about 
8, which should, presumably, entail “complete ignorance” about any function, 9(@), of @. 
However, p(@) uniform implies that p(g()) is not uniform. This makes it clear that ad hoc 
intuitive notions of “ignorance, or of what constitutes a “non-informative” prior distribution 
(in some sense), cannot be relied upon. There is a need for a more formal analysis of the 
concept and this will be given in Section 5.4, with further discussion in Section 5.6.2. 


Proposition 5.2 established the general forms of Bayes estimates for some 
commonly used loss functions. Proposition 5.7 provided further insight into the 
(posterior mean) form arising from quadratic loss in the case of an exponential 
family parametric model with conjugate prior. Within this latter framework, the 
following development, based closely on Gutiérrez-Pefia (1992), provides further 
insight into how the posterior mode can be justified as a Bayes estimate. 

We recall, from the discussion preceding Proposition 5.6, the canonical forms 
of the k-parameter exponential family and its corresponding conjugate prior: 


Pyle) = a(y) exp {y'p — b()}, yeY 


and 
P(P|No, Yo) = c(No, Yo) exp {noyhy — nob()}, We ¥, 


for appropriately defined Y, Y and real-valued functions a, b and c. 
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Consider p(w|n. yo) and define d(s.t) = —loge(s. s°'t), with s > 0 and 
t ¢ Y. Further define 
dd(s.t) — Ad(s.t) : 
Of; —  ° Ot*e 
= [d,(s.t).....dy(s.t)}' 
and dy(s.t) = Od(s.t)/Os. As a final preliminary, recall the logarithmic diver- 
gence measure 


Vd(s.t) = 


7 p(x | 8) 
5(8| 80) = [v(210) og Ae 


between two distributions p(a|@) and p(x|@)). We can now establish the following 
technical results. 


dx 


Proposition 5.8. (Logarithmic divergence between conjugate distributions). 
With respect to the canonical form of the k-parameter exponential family and 
its corresponding conjugate prior: 

(i) 5(|aby) = (yy) — BCH) + (W — py) VO(W): 


(ti) E[d(p|wy)) = do(no. Noyo) + b( By) 
+ny7! {hk + [Vd(. nnyy) — Wo]! noyo}- 


Proof. From the definition of logarithmic divergence we see that 
O(plyq) = b(py) — WCW) + (HY - Ho) Eyaplyl- 
and (i) follows. Moreover, 
E[5(b]bo)] = (to) — E[b()] + Elb'VO(w)] — WoE[VO(w)]. 
Differentiation of the identity 
log f exp{t'w — sb(w)}dw = d(s,t). 
with respect to s, establishes straightforwardly that 
E[b()] = —do(no. noyo)- 
Recalling that E[Vb(w)] = yp. we can write. for? = 1,.... hk 
tog f b.() exp{t'h — sb(w)}dw = logt, — loge(s.s 't) — logs. 


Differentiating this identity with respect to ¢;, and interchanging the order of dif- 
ferentiation and integration, we see that 


J vricorets. s'thexp{t'w — sb()}dw = s°'[1 + d,(s.t)t)]. 
fori = 1.....k, so that 


Elep' Vo(w)] = ng '[k + Vd(no. noya)! (n0¥o)] — no Wo(r. Yo) 
and (ii) follows. = g 
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This result now enables us to establish easily the main result of interest. 


Proposition 5.9. (Conjugate posterior modes as Bayes estimates). 

With respect to the loss function l(a,w) = 6(ap\|a), the Bayes estimate for 
a, derived from independent observations y,,....Y, from the canonical 
k-parameter exponential family p(y|y) and corresponding conjugate prior 
p(|no. Yo), is the posterior mode, wW", which satisfies 


Vd(w") = (no +n)! (noyy + NH,). 
with Yn = ny, aed Yn): 


Proof. We note first (see the proof of Proposition 5.6) that the logarithm of 
the posterior density is given by 


constant + (noyy + ny,,)'w — (ro + n)b(~p), 


from which the claimed estimating equation for the posterior mode, ~*, is immedi- 
ately obtained. The result now follows by noting that the same equation arises in the 
minimisation of (ii) of Proposition 5.8, with no + n replacing no, and noYy + ny, 
replacing noyp-. 


For a recent discussion of conjugate priors for exponential families, see Con- 
sonni and Veronese (1992b). In complex problems, conjugate priors may have 
strong, unsuspected implications; for an example, see Dawid (1988a). 


5.2.3 Approximations with Conjugate Families 


Our main motivation in considering conjugate priors for exponential families has 
been to provide tractable prior to posterior (or predictive) analysis. At the same 
time, we might hope that the conjugate family for a particular parametric model 
would contain a sufficiently rich range of prior density “shapes” to enable one 
to approximate reasonably closely any particular actual prior belief function of 
interest. The next example shows that might well not be the case. However, it 
also indicates how, with a suitable extension of the conjugate family idea, we can 
achieve both tractability and the ability to approximate closely any actual beliefs. 


Example 5.7. (The spun coin ). Diaconis and Ylvisaker (1979) highlight the fact that, 
whereas a tossed coin typically generates equal long-run frequencies of heads and tails, this 
is not at all the case if a coin is spun on its edge. Experience suggests that these long-run 
frequencies often turn out for some coins to be in the ratio 2:1 or 1:2, and for other coins 
even as extreme as 1:4. In addition, some coins do appear to behave symmetrically. 

Let us consider the repeated spinning under perceived “identical conditions” of a given 
coin, about which we have no specific information beyond the general background set out 
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above. Under the circumstances specified, suppose we judge the sequence of outcomes to 
be exchangeable, so that a Bernoulli parametric model, together with a prior density for the 
long-run frequency of heads, completely specifies our belief model. How might we represent 
this prior density mathematically? 

We are immediately struck by two things: first. in the light of the information given, any 
realistic prior shape will be at least bimodal. and possibly trimoda): secondly. the conjugate 
family for the Bernoulli parametric model is the beta family (see Example 5.4). which 
does not contain bimodal densities. It appears. therefore, that an insistence on tractability. 
in the sense of restricting ourselves to conjugate priors, would preclude an honest prior 
specification. 

However, we can easily generate multimodal shapes by considering mixtures of beta 
densities, 


pO |. a. B) = > m,Be(Ala,..3,). 
pod 


with ee eae mT; > Uy +---+7,, = llattached to a selection of conjugate densities. 
Be(Ala,.3)b= 10... m. Figure 5.2 displays the prior density resulting from the mixture 


0.5 Be(@| 10.20) + 0.2 Be(@| 15. 15) + 0.3 Be(@} 20. 10). 


which, among other things, reflects a judgement that about 20% of coins seem to behave 
symmetrically and most of the rest tend to lead to 2:1 or 1:2 ratios, with somewhat more of 
the latter than the former. 

Suppose now that we observe 7? outcomes xz = (.1)......",) and that these result in 
r, =a) +--+ +0, heads, so that 


PP heves r, |@) = I" Gays sae (layer, 
Considering the general mixture prior form 
p(Ojmw.a. B) = yon Be(@|a,..3,). 
sel 


we easily see from Bayes’ theorem that 
D(A | wo. 8.2) = p(@| ma’. p*). 
where 
alsa,tr,. Fat ten-r, 
and 
wx x, | ar (1 ~ ay" " Be(@la,..3,) dé 
1 


Ma, +3) Ploy (as } 


*™ Par) Tare 4) 
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so that the resulting posterior density, 


i) 


p(O| mw. a. 8,2) = yor Be(@[a7.37). 


ial} 


is itself a mixture of rr beta components. This establishes that the general mixture class of 
beta densities is closed under sampling with respect to the Bernoulli model. 


In the case considered above, suppose that the spun coin results in 3 heads after 10 
spins and 14 heads after 50 spins. The suggested prior density corresponds to m. = 3, 


mw = (0.5,0.2,0.3), @ = (10,15,20), 6 = (20,15, 10). 


p(9|r, = 14,n = 50) 


p(Alr,, = 3,n = 10) 


0.2 0.4 0.6 0.8 1 


Figure 5.2 Prior and posteriors from a three-component beta mixture prior density 
Detailed calculation yields: 


for n = 10, r, = 3; m* = (0.77,0.16.0.07), 
a’ = (13,18, 23), B° = (27, 22,17) 
for n = 50, r,, = 14; «° = (0.90, 0.09, 0.006). 
a” = (24, 29,34), 8" = (56,51, 46), 


and the resulting posterior densities are shown in Figure 5.2. 
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This example demonstrates that, at least in the case of the Bernoulli parametric 
model and the beta conjugate family, the use of mixtures of conjugate densities 
both maintains the tractability of the analysis and provides a great deal of flexibility 
in approximating actual forms of prior belief. In fact, the same is true for any 
exponential family model and corresponding conjugate family, as we show in the 
following. 


Proposition 5.10. (Mixtures of conjugate priors). Let x = (2\...... r,) be 
a random sample from a regular exponential family distribution such that 


p(a | 6) = le z;){g(8)]" aps c,0;(8) (>: nen) 
j=l 


j=l =] 


and let 


ay 


P(O|w.T1..... Til = Y= mp(8| 71). 


where, forl =1,.... m, 


k 
pe | T1) = {A(7))]° '{g(@))" exp 2 <o(@in 


r=l 


are elements of the conjugate family, Then 


P(O| Ww. 71,..-. Tn) = P(O| HW. 7)... 75,) = 7; p(O| 77). 


where, with t, (@) = Cope Tica ete ae nu} 
TT, = 7, +t,(2). 
and 


ri xm] se) ea 


Proof. The results follows straightforwardly from Bayes’ theorem and Propo- 
sitionS.5. q 
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It is interesting to ask just how flexible mixtures of conjugate prior are. The 
answer is that any prior density for an exponential family parameter can be approx- 
imated arbitrarily closely by such a mixture, as shown by Dalal and Hall (1983), 
and Diaconis and Ylvisaker (1985). However, their analyses do not provide a con- 
structive mechanism for building up such a mixture. In practice, we are left with 
having to judge when a particular tractable choice, typically a conjugate form, a 
limiting conjugate form, or a mixture of conjugate forms, is “good enough, in the 
sense that probability statements based on the resulting posterior will not differ 
radically from the statements that would have resulted from using a more honest, 
but difficult to specify or intractable, prior. 

The following result provides some guidance, in a much more general setting 
than that of conjugate mixtures, as to when an “approximate” (possibly improper) 
prior may be safely used in place of an “honest” prior. 


Proposition 5.11. (Prior approximation). Suppose that a belief model is de- 
fined by p(x|@) and p(@), 8 € © and that q(@) is a non-negative function 
such that q(x) = J, p(x|@)q(@)d@ < 00, where, for some Oo C © and 
a, BER, ; 

(a) 1 < p(8)/q(@) < 1+, for all 8 € Op, 

(b) p(@)/q(@) < G, for all 6 € ©. 
Letp= Soo p(O|x)d8, q = Soy q(@ | x)d0, and 
q(@ |x) = p(x|6)q(8)/ f p(x | @)q(@)d6. Then, 

(i) (1—p)/p < B(1 - 9)/4 

(ii) q < p(w)/q(x) < (1+ @)/p 
(iii) for all @ € ©, p(O| x)/q(| x) < [p(8)/q(8)]/q < 8/4 

(iv) for all 8 € Oo, p/(1+ a) < p(O|x)/q(@|x) < (1+ @)/q 

(v) fore = max {(1—p),(1—q)} and f : © — R such that | f(@)| <m, 


m7 [ HVv10\2)40- [ f(8)q(0 |2)a6| <a 43 


Proof. (Dickey, 1976). Part (i) clearly follows from 


Clearly, 
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1 - Sup 
ula) > [, ax\e)aado >< [2 1a)pto)d0 = ne) 


which establishes (ii). Part (iii) follows from (b) and (ii), and part (iv) follows from 
(a) and (ii). Finally, 


wr [ f()p(8 |x)d0 - [ F(8)a(@ | x) < | ex) - (02) d0 
</ pl@ |x) - (81a) 9+ [ |n(o{2) - 4(0\2)|U0 
Jey Jor 
pale). ) Jeo | 6 6 0 
sf, ola) (2 1 + [p@rerla ne q(O|ax)\d 


| 
< | a(@ lx) (22-1 )fao+ a -» ea -a) (by iv) 
On 


=(lt+a-qg+(Q-p)t+Ul—-q) <at3e. 


which proves (v). 


If, in the above, Oy is a subset of © with high probability under g(@ | az) and a 
is chosen to be small and ,3 not too large, so that q(@) provides a good approximation 
to p(@) within Oy and p(@) is nowhere much greater than q(@), then (i) implies that 
Qy has high probability under p(@ | a) and (ii), (iv) and (v) establish that both the 
respective predictive and posterior distributions, within 9, and also the posterior 
expectations of bounded functions are very close. More specifically, if f is taken 
to be the indicator function of any subset 0” C ©, (v) implies that 


if p(O|x)dO -[ 4(0|x)d0) <a+t3e. 
8 a | 


providing a bound on the inaccuracy of the posterior probability statement made 
using g(@ | a) rather than p(@ | a). 

Proposition 5.11 therefore asserts that if a mathematically convenient alterna- 
tive, g(@), to the would-be honest prior, p(@), can be found, giving high posterior 
probability to a set 9g C O within which it provides a good approximation to p(@) 
and such that it is nowhere orders of magnitude smaller than p(@) outside Oy. then 
q(@) may reasonably be used in place of p(6). 

In the case of © = R, Figure 5.3 illustrates, in stylised form. a frequently 
occurring situation, where the choice g(@) = c, for some constant ¢. provides 
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Figure 5.3 Typical conditions for precise measurement 


a convenient approximation. In qualitative terms, the likelihood is highly peaked 
relative to p(@), which has little curvature in the region of non-negligible likelihood. 

In this situation of “precise measurement” (Savage, 1962), the choice of the 
function g(@) = c, for an appropriate constant c, clearly satisfies the conditions of 
Proposition 5.10 and we obtain 


__P(x| Ae p(x | 8) 


Pola) = ae le) Jap(a|0)cd6 J, p(a|0) do 
the normalised likelihood function. 

The second of the implementation questions posed at the end of Section 5.1.6 
concerned the possibility of avoiding the need for precise mathematical represen- 
tation of the prior density in situations where the information provided by the data 
is far greater than that implicit in the prior. The above analysis goes some way to 
answering that question; the following section provides a more detailed analysis. 


5.3 ASYMPTOTIC ANALYSIS 


In Chapter 4, we saw that in representations of belief models for observables in- 
volving a parametric model p(z | @) and a prior specification p(@), the parameter 0 
acquired an operational meaning as some form of strong law limit of observables. 
Given observations x = (z;...., Zn), the posterior distribution, p(@ | x), then de- 
scribes beliefs about that strong law limit in the light of the information provided 
by r1,...,2n. To answer the second question posed at the end of Section 5.1.6, we 
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now wish to examine various properties of p(@ | x) as the number of observations 
increases; i.e., as n — oc. Intuitively, we would hope that beliefs about @ would 
become more and more concentrated around the “true” parameter value; i.e., the 
corresponding strong law limit. Under appropriate conditions. we shall see that 
this is, indeed, the case. 


5.3.1 Discrete Asymptotics 


We begin by considering the situation where 9 = {0,.05..... } consists of acount- 
able (possibly finite) set of values, such that the parametric model corresponding to 
the true parameter, 9;, is “distinguishable” from the others, in the sense that the log- 
arithmic divergences, f p(x | 0;) log{p(.r | @;)/p(x | @,)] dx are strictly larger than 
zero, for all i # t. 


Proposition 5.12. (Discrete asymptotics). Let x = (1).....1,) be obser- 
vations for which a belief model is defined by the parametric model p(x | @). 
where 0 € © = {0;,02....}, and the prior p(@) = {pi.po....}. p, > 0. 
>>, pi = 1. Suppose that 0, € © is the true value of 6 and that, for alli # t, 


p(x | 6) 
[ooo [6 a> 


then 
lim p(@,|2) = 1. lim p(@;|x) = 0. 1 At. 
WK i— *« 
Proof. By Bayes’ theorem, and assuming that p(x|@) = []'_., p(ri/0). 
p(x | 8;) 
0; |x) = p, —— 
p(O; |x) = 7 wz) 
_ _bet{p(e|9,)/p(x | 8:)} 
D, Pi {p(x | 8;)/p(x | 81)} 
_ exp {log p, + Si} 
~ S* exp {logp, + S,}° 


where 


— Slog Bite 8) 
* = ay 


Conditional on @;, the latter is the sum of 7 independent identically distributed 
random quantities and hence, by the strong law of large numbers (see Section 3.2.3), 


lim “5, a ih oir | 8,) log | dz. 
now p(x | 81) log p(7 | 4;) 
The right-hand side is negative for all i # f. and equals zero for i} = ¢, so that, 
asn — x, 5S; — Oand 5S, — —~x for? # t. which establishes the result. 
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An alternative way of expressing the result of Proposition 5.12, established 
for countable ©, is to say that the posterior distribution function for @ ultimately 
degenerates to a step function with a single (unit) step at 9 = @;. In fact, this 
result can be shown to hold, under suitable regularity conditions, for much more 
general forms of O. However, the proofs require considerable measure-theoretic 
machinery and the reader is referred to Berk (1966, 1970) for details. 

A particularly interesting result is that if the true @ is nor in ©, the poste- 
rior degenerates onto the value in © which gives the parametric model closest in 
logarithmic divergence to the true model. 


5.3.2 Continuous Asymptotics 


Let us now consider what can be said in the case of general © about the forms of 
probability statements implied by p(@ | a) for large n. Proceeding heuristically for 
the moment, without concern for precise regularity conditions, we note that, in the 
case of a parametric representation for an exchangeable sequence of observables, 


p(8 |x) x p(@) |] p(x: 14) 
i=l 
x exp {log p(@) + log p(x | @)} . 
If we now expand the two logarithmic terms about their respective maxima, my 


and @,,, assumed to be determined by setting V log p(@) = 0, V log p(x | 8) = 0, 
respectively, we obtain 


log p(8) = 10g p(rm0) — 5(8 ~ 10) Ho(8 — mo) + Ro 
log p(x | 8) = log p(a | 8x) ~ 5(8 ~ Bx)! (8n)(8— Bn) + Ro. 


where Ro, R, denote remainder terms and 


0 log p(A) ae 0? log p(x | 8) | 
Ho (- 30,00; = H(b.) = (- 86,00; leas 


Assuming regularity conditions which ensure that Ro, R, are small for large n, and 
ignoring constants of proportionality, we see that 


p(@|x) « exp {-3@ — mo)! Ho(9 — mo) — AG — 6,)'H(0,)(0 — 8,)} 


x exp { -3(6 = m,)' H,(0 = m,)} . 


288 3 Inference 


with ss 
H,, oT: Hy + A(@,) 


m, =H; (Homo re H(4,)6,) 


where mg (the prior mode) maximises p(@) and 6, (the maximum likelihood es- 
timate) maximises p(x | 6). The Hessian matrix, H(6,,). measures the local cur- 
vature of the log-likelihood function at its maximum, 6,,, and is often called the 
observed information matrix. 

This heuristic development thus suggests that p(@ | a) will. for large n, tend 
to resemble a multivariate normal distribution, N;.(@ | m,,, H,,) (see Section 3.2.5) 
whose mean is a matrix weighted average of a prior (modal) estimate and an 
observation-based (maximum likelihood) estimate, and whose precision matrix 
is the sum of the prior precision matrix and the observed information matrix. 

Other approximations suggest themselves: for example. for large 1 the prior 
precision will tend to be small compared with the precision provided by the data 
and could be ignored. Also, since, by the strong law of large numbers. for all i,j. 


— fl f ®log p(x] 0) — fl gs d log p(2{9) 
hes (i (- 00,08, )} ae es (- 00,08, 
& log p(w |) 
p(w | 0) (-ee —— ) de 
/ 00,00, 


we see that H(6,,) ~ nI(6,,). where I(@), defined by 


a | :|0 
(110), = f wher t@) (SERIO) a, 


is the so-called Fisher (or expected) information matrix. We might approximate 
p(@| a), therefore, by either N,(@|6,,.H(8,,)) or N,(@|@,,.17/(0,,)), where * is 
the dimension of 6. 

In the case of 6 O CR, 


- iP 
A(é) = ~ Fp log p(w | @). 

so that the approximate posterior variance is the negative reciprocal of the rate of 
change of the first derivative of log p(a |) in the neighbourhood of its maximum. 
Sharply peaked log-likelihoods imply small posterior uncertainty and vice-versa. 

There is a large literature on the regularity conditions required to justify mathe- 
matically the heuristics presented above. Those who have contributed to the field in- 
clude: Laplace (1812). Jeffreys (1939/1961, Chapter 4), LeCam (1953, 1956, 1958, 
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1966, 1970, 1986), Lindley (1961b), Freedman (1963b, 1965), Walker (1969), 
Chao (1970), Dawid (1970), DeGroot (1970, Chapter 10), Ibragimov and Hasmin- 
ski (1973), Heyde and Johnstone (1979), Hartigan (1983, Chapter 4), Bermudez 
(1985), Chen (1985), Sweeting and Adekola (1987), Fu and Kass (1988), Fraser 
and McDunnough (1989), Sweeting (1992) and Ghosh et al. (1994). Related work 
on higher-order expansion approximations in which the normal appears as a leading 
term includes that of Hartigan (1965), Johnson (1967, 1970), Johnson and Ladalla 
(1979) and Crowder (1988). The account given below is based on Chen (1985). 
In what follows, we assume that @ € © C WR and that {p,,(0),n = 1, 
2,...} is a sequence of posterior densities for 0, typically of the form p,(@) = 
p(@|21....,£n), derived from an exchangeable sequence with parametric model 
p(x |@) and prior p(@), although the mathematical development to be given does 
not require this. We define L,,(@) = log p,(@), and assume throughout that, for 
every n, there is a strict local maximum, m,, of p, (or, equivalently, L,,) satisfying: 


Li,(mz) = VL»(8) |o-m,, =0 
and implying the existence and positive-definiteness of 
Ly = (—Li(m,))~' : 


where [L),(™mn)],; = (0?Ln(0)/00:90;) |g-m,: 
Defining |@| = (0'0)'/? and B;(0°) = {0 € 9; |@ — 8° | < 5}, we shall 
show that the following three basic conditions are sufficient to ensure a valid normal 


approximation for p,,(@) in a small neighbourhood of m,, as n becomes large. 


(cl) “Steepness”. G2 — 0 as n — 00, where G? is the largest eigenvalue of D). 


(c2) “Smoothness”. For any ¢ > 0, there exists N and 6 > 0 such that, for any 
n> N and@€ Bs(m,), L7,(@) exists and satisfies 


I - Ale) < LY (O){L"(m,)} 7! < 1+ Ale), 


where I is the k x k identity matrix and A(e¢) is ak x k symmetric positive- 
semidefinite matrix whose largest eigenvalue tends to zero as € — 0. 


(c3) “Concentration”. For any 6 > 0, Sasemn) Pn(@)d@ > lasn > x. 


Essentially, we shall see that (c1), (c2) together ensure that, for large n, inside 
a small neighbourhood of m,, the function p,, becomes highly peaked and behaves 
like the multivariate normal density kernel exp{— 5 (@—m,)' £;' (@—m,)}. The 
final condition (c3) ensures that the probability outside any neighbourhood of m,, 
becomes negligible. We do not require any assumption that the m,, themselves 
converge, nor do we need to insist that m, be a global maximum of p,. We 
implicitly assume, however, that the limit of p,,(m,,) | 5, |’? exists as mn — 90, 
and we shall now establish a bound for that limit. 


290 5S Inference 


Proposition 5.13. (Bounded concentration). 

The conditions (cl), (c2) imply that 
lim pp(m™m,) [En]? < (2m) *?. 
Na x 

with equality if and only if (c3) holds. 


Proof. Given € > Q, consider n > N and 6 > 0) as given in (c2). Then. for 
any @ € Bs(m,,), a simple Taylor expansion establishes that 


Pr(O) = P»(my) exp {L, (8) _ L,,(m,)} 


= pa(mnn)oxp | 58 —m,)'(I+ R,)EZ'(0 - m,)}. 


where 
R, = L"(0*){L"(m,)}7'(m,) — 1. 


for some 8 lying between @ and m,,. It follows that 


P.(6)= |p @)d0 
JBM) 
is bounded above by 


P*(d) = pa(m,){Z, |? | -— Ale) | ef exp {—42'z} dz 


|Z\<sn 


and below by 
P, (6) = Pn(™M») |=, | 2 | I+ A(e)|~ ve | exp {-32' z} dz. 
FZ [ote 


where s,, = 5(1 — a(e))/*/o,, and t, = 6(1 + a(s))'/7/,,, with &2(g2) and 
@(=)(a(e)) the largest (smallest) eigenvalues of ,, and A(é). respectively, since, 
for any & x k matrix V, 


Bye (0) C {z: (z'Vz)l < of © Bya-(0). 
where Vv?) are the largest (smallest) eigenvalues of V. 
Since (cl) implies that both s,, and ¢,, tend to infinity as n — 2%, we have 
|Z - A(e)|? lim P,(6) < Jim 0 Pn(™my)|E,|'/?(2m)*? 
nx 
< [1+ A(e)|!? im P,,(6). 


and the required inequality follows from the fact that |J + A(e)| — lase — Oand 
P,(6) < 1 for all n. Clearly, we have equality if and only if lim, .. P,(6) = 1. 
which is condition (c3). 
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We can now establish the main result, which may colloquially be stated as “@ 
has an asymptotic posterior N,(@|m,, ©!) distribution, where Li,(m,,) = 0 and 
Uy) = —£;(m,).” 


Proposition 5.14. (Asymptotic posterior normality). For each n, consider 
Pn(-) as the density function of a random quantity 0, and define, using the 
notation above, }, = bid (J — Mn). Then, given (cl) and (c2), (c3) is 
a necessary and sufficient condition for ,, to converge in distribution to $, 
where p() = (2m)~*/? exp {-3.¢'@}. 


Proof. Given (cl) and (c2), and writing b > a, for a, b € R*, to denote that 


all components of b — a are non-negative, it suffices to show that, asn — ox, 
P,(a < ¢, < 6) — P(a < ¢ < b) if and only if (c3) holds. 


We first note that 


Pia<eeb)< [ pn (0)d0, 


On 


where, by (cl), for any 5 > 0 and sufficiently large n, 
©, = {0;517a < (6 — my) < E17} C Bo(mn). 


It then follows, by a similar argument to that used in Proposition 5.13, that, 
for any ¢ > 0, P,,(a < @,, < 6) is bounded above by 


Palma) AC"? BGal? exp {—}2'z} ae, 
é 


where 
Ze) = {25 I - A(e)|"2a<z<[1- Ale]? o} ; 


and is bounded below by a similar quantity with + A(<) in place of — A(e). 
Given (cl), (c2), as € — 0 we have 


n—-00 


lim P,(a < $, < 6) = lim pa(mn) | Zn} a exp {-32'z} dz, 
n% Z(0) 


where Z(0) = {z;@ < z < b}. The result follows from Proposition 5.13. 
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Conditions (cl) and (c2) are often relatively easy to check in specific appli- 
cations, but (c3) may not be so directly accessible. It is useful therefore to have 
available alternative conditions which, given (c1), (c2). imply (c3). Two such are 
provided by the following: 


(c4) For any 6 > 0, there exists an integer N and c, d € R7 such that, for any 
n> Nand @ ¢ B;(m,), 


L, (9) — L,(m,,) < -c{(@—m,)'E, (6 —_m,)}". 


(c5) As (c4), but, with G(@) = log g(@) for some density (or normalisable positive 
function) g(@) over 0. 


L,(0) — L,(m,) < —c{%,|-" + G(8). 
Proposition 5.15. (Alternative conditions). Given (cl), (c2), either (c4) or 
(cS) implies (c3). 


Proof. tis straightforward to verify that 
Do on, Pnl 148 < parm) Ea|!® f exp {e(2"2)"} de. 
8-BsMn) Jer. hlFy 
given (c4), and similarly, that 
f Pn(8)dO < Py(AMy) En]? Sal"? exp {eC [Ea] }. 
6-Bs(Mn) 


given (c4). 

Since p,,(m,,) | E,, | !/? is bounded (Proposition 5.11) and the remaining terms 
or the right-hand side clearly tend to zero, it follows that the left-hand side tends to 
zeroasn—- x. g 


To understand better the relative ease of checking (c4) or (c5) in applications, 
we note that. if p,,(@) is based on data z, 


L,,(@) = log p(@) + log p(x | 8) — log p(x). 


sothat L,,(@)—L,,(m,,) does not involve the, often intractable, normalising constant 
p(x). Moreover, (c4) does not even require the use of a proper prior for the vector 0. 

We shall illustrate the use of (c4) for the general case of canonical conjugate 
analysis for exponential families. 
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Proposition 5.16. (Asymptotic normality under conjugate analysis). 
Suppose that y,,....y,, are data resulting from a random sample of size n 
from the canonical exponential family form 


p(y |) = a(y) exp {yh — b(p)} 
with canonical conjugate prior density 
P(P | Ro, Yo) = ¢(Mo, Yo) ExP {NoYHY — nob(p)} . 
For each n, consider the posterior density 
Pal) = p(w |no +n, noyo + NY). 


with 9, = 32, y;/n, to be the density function for a random quantity w,,, 
and define $,, = Dj,'/"(wp,, — 6'(m,)), where 


ee ce) 7 ere 
”" 7 07b(w) 7 7 
(b (m,)),, = (Sse) aie (ng + n) (Zn); ; 


Then o,, converges in distribution to @, where 
p(P) = (27)? exp {-3¢'9}. 


Proof. Colloquially, we have to prove that ~ has an asymptotic posterior 
Ni. (w | 6'(m,,), £7!) distribution, where 6'(m/,) = (no + n)~'(noyo + n¥,,) and 
L7! = (np + n)~'b"(m,). From a mathematical perspective, 


Pr(w) x exp {(ro + n)h(p)}, 


where h(y) = [b'(m,)]'y — b(w), with b(y) a continuously differentiable and 
strictly convex function (see Section 5.2.2). It follows that, for each n, p,,(w) is 
unimodal with a maximum at ~ = m,, satisfying Vh(m,) = 0. By the strict 
concavity of h(-), for any 5 > O and 9 ¢ B;(m,,), we have, for some a+ between 
w and m,, with angle 9 between w — m, and VA(~w*), 


A(p) —h(m,) = (YW - m,)'Vh(p*) 
= |p-—m,| | VA(p*)| cosé 
<-¢ | y-—m, | : 
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for c = inf {| Vh(w*) |: ¢ Bs(m,)} > 0. It follows that 
Ey (wp) — L,,(m,) < —(no tn) [4b -— m, | 
< —e) {(p—m,)'E7'(b—m,)}'. 


where c, = c\~!, with ” the largest eigenvalue of 6”(m,,), and hence that (c4) is 
satisfied. Conditions (c1), (c2) follows straightforwardly from the fact that 


(ng + w)Z; 1 = b’(m,). 
L’(a){ L" (m.,) ye -1 = b"(w) {b"( (m,,) some 


the latter not depending on ry + nr. and so the result follows by Propositions 5.12 
and 5.13. g 


Example 5.4. (continued). Suppose that Be(@|a,,.3,,). where a, = a + r,, and 
3, = 3+n—7,,. is the posterior derived from 1 Bernoulli trials with r, successes and a 
Be(4 | a. ;3) prior. Proceeding directly. 


L,,(@) = log p,,(@) = log p(x | @) + log p(@) — log p(x) 
= (a, — 1) log@ + (3, — 1) log() — @) - log p(x) 


so that 1) (4-1) 
t = (a, rE _ Ata 

L,(@) — 6 1 - 7] 
and ie 1) 

uw (a, = hom 

L — _ 4 
It follows that 
_ a, — 1 we ate 1 (a, DCA 1) 
eee aeae rN Sg, OF 


Condition (c1) is clearly satisfied since (—L(in,,)) ' + Vas n — x: condition (c2) 
follows from the fact that L"(@) is a continuous function of 4. Finally, (c4) may be verified 
with an argument similar to the one used in the proof of Proposition 5.16. 

Taking a = 3 = 1 for illustration, we see that 

my, = ae (-<L"Gmn,)) b= a 2 (1 - =~) : 
n 


and hence that the asymptotic posterior for # is 


n(o].{2-2 (1-2) ) 


(As an aside, we note the interesting “duality” between this asymptotic form for @ given 
n, 1, and the asymptotic distribution for 7, /n given @, which, by the central limit theorem, 


has the form 
x(% af cao} ) 
nt n 


Further reference to this kind of “duality” will be given in Appendix B.) 
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5.3.3 Asymptotics under Transformations 


The result of Proposition 5.16 is given in terms of the canonical parametrisation of 
the exponential family underlying the conjugate analysis. This prompts the obvious 
question as to whether the asymptotic posterior normality “carries over, with ap- 
propriate transformations of the mean and covariance, to an arbitrary (one-to-one) 
reparametrisation of the model. More generally, we could ask the same question 
in relation to Proposition 5.14. A partial answer is provided by the following. 


Proposition 5.17. (Asymptotic normality under transformation). 

With the notation and background of Proposition 5.14, suppose that 8 has 
an asymptotic N;,(@|m,,,©;') distribution, with the additional assumptions 
that, with respect to a parametric model p(x|@o), 2 — 0 and my, — 0 
in probability, and that given any 6 > 0, there is a constraint c(6) such that 
P(e? a7 < c(5)) > 1-64 for all sufficiently large n, where G2 (a”) is the 
largest (smallest) eigenvalue of ©2. Then, if v = g(@) is a transformation 
such that, at 0 = 8p, 

dg(8) 


is non-singular with continuous entries, v has an asymptotic distribution 
Ne (v | 9(7mn), [Tg (rn) EnFg(mn)}"") 


Proof. This is a generalization and Bayesian reformulation of classical results 
presented in Serfling (1980, Section 3.3). For details, see Mendoza (1994). g 


For any finite n, the adequancy of the normal approximation provided by 
Proposition 5.17 may be highly dependent on the particular transformation used. 
Anscombe (1964a, 1964b) analyses the choice of transformations which improve 
asymptotic normality. A related issue is that of selecting appropriate parametrisa- 
tions for various numerical approximation methods (Hills and Smith, 1992, 1993). 

The expression for the asymptotic posterior precision matrix (inverse covari- 
ance matrix) given in Proposition 5.17 is often rather cumbersome to work with. 
A simpler, alternative form is given by the following. 


Corollary 1. (Asymptotic precision after transformation). 
In Proposition 5.10, if H, = &;,' denotes the asymptotic precision matrix for 
6, then the asymptotic precision matrix for v = g(@) has the form 
Jg-1(9(mn)) HnJg-1(9(m)), 
where age) 
g (Vv 

J go! (v ) = ies «- 

is the Jacobian of the inverse transformation. 


Proof. This follows immediately by reversing of the roles of andy. gq 
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In many applications, we simply wish to consider one-to-one transformations 
ofa single parameter. The next result provides a convenient summary of the required 
transformation result. 


Corollary 2. (Asymptotic normality after scalar transformation). 
Suppose that given the conditions of Propositions 5.14, 5.17 with scalar 0, the 
sequence m.,, tends in probability to 0, under p(x\0o), and thar L"(m,,) — 0 
in probability as n — x. Then, if v = g(@) is such that g'(@) = dg(@)/ d@ is 
continuous and non-zero at § = 9p, the asymptotic posterior distribution for 
vis 

N(vlg(am,). —L" (mn, [g(r J). 


Proof. The conditions ensure, by Proposition 5.14, that 6 has an asymptotic 
posterior distribution of the form N(@|r,,. -—L"(,,)), so that the result follows 
from Proposition 5.17. g 


Example 5.4. (continued). Suppose, again, that Be(@ | «,,..3,,), where a, =a +r,. 
and ,3,, = 3+n-—r,, is the posterior distribution of the parameter of a Bernoulli distribution 
afte n trials, and suppose now that we are interested in the asymptotic posterior distribution 
of the variance stabilising transformation (recall Example 3.3) 


v= G(@) = 2sin7! Va. 
Straightforward application of Corollary 2 to Proposition 5.17. leads to the asymptotic 


distribution 
N(v|2sin '(y/r,,/n). 0). 


whose mean and variance can be compared with the forms given in Example 3.3. 


It is clear from the presence of the term [g’(7m,,)] 7 inthe form of the asymptotic 
precision given in Corollary 2 to Proposition 5.17 that things will go wrong if 
g'(m,) > Oasn — 2. This is dealt with in the result presented by the requirement 
that 9’(9)) 4 0. where mm,, — 6) in probability. A concrete illustration of the 
problems that arise when such a condition is not met is given by the following. 


Example 5.8. (Non-normal asymptotic posterior). Suppose that the asymptotic pos- 
terior for a parameter @ € R is given by N(@]¥:,,. 01). ar, =.) t +++ + ,,. perhaps derived 
from N(r,|6.1), 6 = f..... n, with N(@|0.#), having 4 = 0. Now consider the transfor- 
mation » = g(@) = 6°. and suppose that the actual value of @ generating the .r, through 
N(s|@. 1) is @ = 0. 

Intuitively, it is clear that v cannot have an asymptotic normal distribution since the 
sequence 77 is converging in probability to 0 through strictly positive values. Technically. 
g'(Q) = O and the condition of the corollary is not satisfied. In fact. it can be shown that the 
asymptotic posterior distribution of 1 is \? in this case. 
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One attraction of the availability of the results given in Proposition 5.17 and 
Corollary | is that verification of the conditions for asymptotic posterior normality 
(as in, for example, Proposition 5.14) may be much more straightforward under one 
choice of parametrisation of the likelihood than under another. The result given 
enables us to identify the posterior normal form for any convenient choice of pa- 
rameters, subsequently deriving the form for the parameters of interest by straight- 
forward transformation. An indication of the usefulness of this result is given in 
the following example (and further applications can be found in Section 5.4). 


Example 5.9. (Asymptotic posterior normality for a ratio). Suppose that we have a 
random sample .,,...,:;, from the model {]];_, N(z;|9,.1),N(@|0, A1)} and, indepen- 
dently, another random sample y,.---.y,, from the model {[]'"_, N(yi|02. 1). N(@2|0. Az}}. 
where \; ~ 0, Ay = O and 6, # 0. We are interested in the posterior distribution of 
1 = 0)/02 as nm — 00. 

First, we note that, for large n, it is very easily verified that the joint posterior distribution 
for 8 = (6,.62) is given by 


6; z,, n 0 
NAG) CS a) f 

where nd, = Fy +--+ + 2n. Nn = Yr +--+ yy. Secondly, we note that the marginal 
asymptotic posterior for ¢, can be obtained by defining an appropriate 2 such that (6; , 02) — 
(1, 2) is a one-to-one transformation, obtaining the distribution of @ = (1, dz) using 
Proposition 5.17, and subsequently marginalising to ¢;. 

An obvious choice for @: is ¢2 = 92, so that, in the notation of Proposition 5.17, 
9(M1, 02) = (d1, b2) and 


Jq(8) = 00/00, 00/002 =(% —6,0;° 
ge" \ 862/00, 892/80.) ~ \ 0 1 ; 


The determinant of this, 6;', is non-zero for 9. 4 0, and the conditions of Proposition 5.17 
are clearly satisfied. It follows that the asymptotic posterior of @ is 


dr male) dy peta eo 
((2) ( ae oie -3, # 


so that the required asymptotic posterior for , = 0; /62 is 


x =2 

Ly, - y, 
N ng (=. ‘ 
¢ a (x + z)] 


Any reader remaining unappreciative of the simplicity of the above analysis may care to 
examine the form of the likelihood function, etc., corresponding to an initial parametri- 
sation directly in terms of ¢,, 2, and to contemplate verifying directly the conditions of 
Proposition 5.14 using the ¢,. 62 parametrisation. 
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5.4 REFERENCE ANALYSIS 


In the previous section, we have examined situations where data corresponding 
to large sample sizes come to dominate prior information, leading to inferences 
which are negligibly dependent on the initial state of information. The third of the 
questions posed at the end of Section 5.1.6 relates to specifying prior distributions 
in situations where it is felt that, even for moderate sample sizes, the data should be 
expected to dominate prior information because of the “vague” nature of the latter. 

However, the problem of characterising a “non-informative” or “objective” 
prior distribution, representing “prior ignorance”, “vague prior knowledge” and 
“letting the data speak for themselves” is far more complex than the apparent 
intuitive immediacy of these words and phrases would suggest. 

In Section 5.6.2, we shall provide a brief review of the fascinating history of 
the quest for this “baseline”, limiting prior form. However, it is as well to make 
clear straightaway our own view — very much in the operationalist spirit with which 
we began our discussion of uncertainty in Chapter 2—that “mere words” are an 
inadequate basis for clarifying such a slippery concept. Put bluntly: data cannot 
ever speak entirely for themselves; every prior specification has some informative 
posterior or predictive implications; and “vague” is itself much too vague an idea 
to be useful. There is no “objective” prior that represents ignorance. 

On the other hand, we recognise that there is often a pragmatically important 
need for a form of prior to posterior analysis capturing, in some well-defined sense, 
the notion of the prior having a minimal effect, relative to the data, on the final 
inference. Such a reference analysis might be required as an approximation to 
actual individual beliefs; more typically, it might be required as a limiting ‘‘what 
if?” baseline in considering a range of prior to posterior analyses, or as a default 
option when there are insufficient resources for detailed elicitation of actual prior 
knowledge. 

In line with the unified perspective we have tried to adopt throughout this vol- 
ume, the setting for our development of such a reference analysis will be the gen- 
eral decision-theoretic framework, together with the specific information-theoretic 
tools that have emerged in earlier chapters as key measures of the discrepancies (or 
“distances”) between belief distributions. From the approach we adopt, it will be 
clear that the reference prior component of the analysis is simply a mathematical 
tool. It has considerable pragmatic importance in implementing a reference anal- 
ysis, whose role and character will be precisely defined, but it is not a privileged, 
“uniquely non-informative” or “objective” prior. Its main use will be to provide 
a “conventional” prior, to be used when a default specification having a claim to 
being non-influential in the sense described above is required. We seek to move 
away, therefore, from the rather philosophically muddled debates about “prior ig- 
norance” that have all too often confused these issues, and towards well-defined 
decision-theoretic and information-theoretic procedures. 
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5.4.1. Reference Decisions 


Consider a specific form of decision problem with possible decisions d € D 
providing possible answers, a € A, to an inference problem, with unknown 
state of the world w = (wy ,W»), utilities for consequences (a,w) given by 
u(d(w1)) = u(a,w1) and the availability of an experiment e which consists of 
obtaining an observation x having parametric model p(z | w2) and a prior proba- 
bility density p(w) = p(w | w2)p(we) for the unknown state of the world, w. This 
general structure describes a situation where practical consequences depend di- 
rectly on the w, component of w, whereas inference from data a € X provided by 
experiment e takes place indirectly, through the w2 component of w as described 
by p(w, |we). If w, is a function of wz, the prior density is, of course, simply 
p(w2). 

To avoid subscript proliferation, let us now, without any risk of confusion, 
indulge in a harmless abuse of notation by writing w, = w,w: = 6. This both 
simplifies the exposition and has the mnemonic value of suggesting that w is the 
state of the world of ultimate interest (since it occurs in the utility function), whereas 
6 is a parameter in the usual sense (since it occurs in the probability model). Often 
w is just some function w = @(@) of 0; if w is not a function of 6, the relationship 
between w and @ is that described in their joint distribution p(w. @) = p(w | @)p(@). 

Now, for given conditional prior p(w | 0) and utility function u(a, w), let us 
examine, in utility terms, the influence of the prior p(@), relative to the observational 
information provided by e. We note that if aj denotes the optimal answer under p(w) 
and a} denotes the optimal answer under p(w | x), then, using Definition 3.13 (ii), 
with appropriate notational changes, and noting that 


[le [ rlwl2)ulas,w) dard = f p(w)u(aj,w) de, 
the expected (utility) value of the experiment e, given the prior p(8), is 
fe, 2(@)} = f rl) f plwlz)u(a;w)dode — [ plw)u(aj..) do, 
where, assuming w is independent of x, given 8, 
plw) = f plw|0)p(0)d0, — plw|x) = f PEM IPIeI) 


nay 
and 


plz) = / p(x |@)p(8) 48. 


If e(k) denotes the experiment consisting of k independent replications of e, that 
is yielding observations {x;,..., 2} with joint parametric model []‘_, p(z: | 9), 
then v,,{e(k), p(@)}, the expected utility value of the experiment e(k), has the 
same mathematical form as v, {e, p(@)}, but with a = (a),...,2,) and p(a | @) = 
1 p(x; | @). Intuitively, at least in suitably regular cases, as k — 00 we obtain, 
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from e(o<), perfect (i.e., complete) information about @, so that, assuming the limit 
to exist, 


tw {e(oo). p(B) } = him va {e(k). p(8)} 


is the expected (utility) value of perfect information, about 8, given p(@). 

Clearly, the more valuable the information contained in p(@), the less will be 
the expected value of perfect information about 8; conversely, the less valuable 
the information contained in the prior, the more we would expect to gain from 
exhaustive experimentation. This, then. suggests a well-defined “thought exper- 
iment” procedure for characterising a “minimally valuable prior”: choose, from 
the class of priors which has been identified as compatible with other assumptions 
about (w, 6), that prior, 7(@), say. which maximises the expected value of perfect 
information about 6. Such a prior will be called a u-reference prior, the posterior 
distributions. 


nw|x) = f plw|@)x (8 | x)dO 
n(O|ax) x p(x | @)x(0) 


derived from combining 7(@) with actual data a, will be called u-reference poste- 
riors: and the optimal decision derived from 7(w | a) and u(a.w) will be called a 
u-reference decision. 

It is important to note that the limit above is nor taken in order to obtain 
some form of asymptotic “approximation” to reference distributions; the “exact” 
reference prior is defined as that which maximises the value of perfect information 
about @, not as that which maximises the expected value of the experiment. 


Example 5.10. (Prediction with quadratic loss). Suppose that beliefs about a se- 
quence of observables, a = (.c;..... w,,), correspond to assuming the latter to be a random 
sample from an N’(.r | jz. A) parametric model, with known precision A, together with a prior 
for j to be selected from the class {N(ji| jo. Av). a € Re Ay > O}. Assuming a quadratic 
loss function, the decision problem is to provide a point estimate for .r,,.;, given.c)...... ry 
We shall derive a reference analysis of this problem, for which A = Row = .r,,.1, and 
4 = te. Moreover, 


(aw) = —(a—,4;)7. p(w] 0) = [Lv |p. A) 
i 


and, for given jin, Ay. we have 
Plw.9) = P(Ppoi- HM) = Pte LOPGO = Nr bee AVN GCL Ha. Av). 


For the purposes of the “thought experiment”. let z, = (2..... ix,) denote the (imagined) 
outcomes of & replications of the experiment yielding the observables (.r;...... ip, ). Say. 
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and let us denote the future observation to be predicted (2;,,4,) simply by x. Then 
vafe(k),M(u| wo. o)} = — f w(2x)int f p(x | 21)(u ~ 2)Pdrde, 
+ int [ (x)(a — x)'dr. 


However, we know from Proposition 5.3 that optimal estimates with respect to quadratic 
loss functions are given by the appropriate means, so that 


vf{e(k). N (i | wo. Ao)} = - [ vevie | zx]dz, + V(z] 
= -V(z|z,] + Viz]. 


since, by virtue of the normal distributional assumptions, the predictive variance of «x given 
z,, does not depend explicitly on z,. In fact. straightforward manipulations reveal that 


v,{e(oc), N(| po. Ao) } = jim vafe(k), N (ge | pa. Av) } 
dim {= [Av' + Qu + bray!) + (A +g!) } = Ag” 


so that the u-reference prior corresponds to the choice Ay = 0, with jig arbitrary. 


Example 5.11. (Variance estimation). Suppose that beliefs about x = {21,...,2n} 
correspond to assuming x to be a random sample from N(x |0, A) together with a gamma 
prior for \ centred on Ay, so that p(A) = Ga(A]a,aAg!), @ > 0. The decision problem is 
to provide a point estimate for o? = A~', assuming a standardised quadratic loss function, 
so that 


u(a,o2) = — eal = —(ad — 1)?. 


Thus, we have A = R+,@ = A, w = o?, and 


4 


p(w, d) = [J N(z, |0,) Ga(A |x, ay!) 
t=1 


Let z, = {a,....,2,} denote the outcome of k replications of the experiment. Then 
o,fe(k),w(a)} == f ze)in€ f p(A} 24) (ad ~ 1)? dddey 
+ inf if p(d) (ad — 1)? dd, 


where 


2 
p(dA) = Ga(Ala.adj'),  p(A| ze) = Ga (ale + =, ary) + “ ) : 
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and kins” = S°, 50, x7,. Since 


inf [ Gallo. (av -1P dd = al 


aed 


and this is attained when a = .3/(a + 1), one has 


yfe(oc). p(A)} = lim v, fe(&). PAD} 


| 1 
hi —- —————_—_~ + —— 7p = . 
fae { L+at (kn)/2 i L+a \ l+a 


This is maximised when a = 0 and, hence, the «-reference prior corresponds to the choice 
cy = 0), with Ap arbitrary. Given actual data, Z = (2)..... r,), the u-reference posterior for 
A is Ga(A | 1/2, 87/2), where 1s? = >>, .r? and, thus, the u-reference decision is to give 
the estimate 5 F 
ns-/2 tur 


ptee Ee ee. 

(nf2)+1 n+2 
Hence, the reference estimator of a? with respect to standardised quadratic loss is nor the 
usual s”, but a slightly smaller multiple of +7. 

It is of interest to note that, from a frequentist perspective, a? is the best invariant 
estimator of a? and is admissible. Indeed, ¢? dominates s” or any smaller multiple of s” in 
terms of frequentist risk (cf. Example 45 in Berger. 1985a, Chapter 4), Thus, the v-reference 
approach has led to the “correct” multiple of s* as seen from a frequentist perspective. 


Explicit reference decision analysis is possible when the parameter space 
0 = {4..... 44,} is finite. In this case, the expected value of perfect information 
(cf. Definition 2.19) may be written as 


M 


M 
vuf{e(oc). p(A)} = s- p(9,) sup u(d(0;)) — sup S— p(6;) u(a(0;)). 
rl 


t=] 


and the u-reference prior, which is that 7(@) which maximises wv, {€(0<¢ ). p(@)}, may 
be explicitly obtained by standard algebraic manipulations. For further information, 
see Bernardo (1981a) and Rabena (1998). 


5.4.2 One-dimensional Reference Distributions 


In Sections 2.7 and 3.4, we noted that reporting beliefs is itself a decision problem, 
where the “inference answer” space consists of the class of possible belief distribu- 
tions that could be reported about the quantity of interest, and the utility function is 
a proper scoring rule which—in pure inference problems — may be identified with 
the logarithmic scoring rule. 
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Our development of reference analysis from now on will concentrate on this 
case, for which we simply denote v,,{-} by v{-}, and replace the term “u-reference” 
by “reference”. 

In discussing reference decisions, we have considered a rather general utility 
structure where practical interest centred on a quantity w related to the @ of an 
experiment by a conditional probability specification, p(w|@). Here, we shall 
consider the case where the quantity of interest is @ itself, with 0 € O C R. More 
general cases will be considered later. 

If an experiment e consists of an observation x € X having parametric model 
p(x |), with w = 6,A = {9(-);q(8) > 0, f,, q(@)d@ = 1} and the utility function 
is the logarithmic scoring rule 


u{q(-), 9} = Alogq(@) + B(9). 


the expected utility value of the experiment e, given the prior density p(8), is 


vfe,p(6)} = / p(x) / u{qe(-).0}p(0 |) dbdx — / u{qol-), 6} p(9) a0 


where go(-),@2(-) denote the optimal choices of q(-) with respect to p(@) and 
p(@|a), respectively. Noting that u is a proper scoring rule, so that, for any 
p(9), 


sup [ ta(.8}0(0) 00 = f ufo), 04016) a0 


it is easily seen that 


v{e,0(8)} « f ple 2) [ p (8 | x) log A tide = Ie. v(6 )} 


the amount of information about 0 which e may be expected to provide. 
The corresponding expected information from the (hypothetical) experiment 
e(k) yielding the (imagined) observation z;, = (a,..-. , ©.) with parametric model 


k 
p(z% |) = [[ p(w: | 9) 
t=) 


is given by 


Hetk).9()) = f r(zn) )f p(6 | 25) log EY dbdz., 


and so the expected (utility) value of perfect information about @ is 


T{e(oo), p(@)} = jim L{e(k), p(8)}, 
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provided that this limit exists. This quantity measures the missing information 
about @ as a function of the prior p(9). 

The reference prior for 9, denoted by 7(@), is thus defined to be that prior 
which maximises the missing information functional. Given actual data x, the 
reference posterior x(6 | x) to be reported is simply derived from Bayes’ theorem, 
as 7(0| x) x p(x | A)71(8). 

Unfortunately. lim;—. /{e(k). p(@)} is typically infinite (unless @ can only 
take a finite range of values) and a direct approach to deriving 7(@) along these 
lines cannot be implemented. However, a natural way of overcoming this technical 
difficulty is available: we derive the sequence of priors 7,.(@) which maximise 
T{e(k). p(0)}.k = 1,2..... and subsequently take 7(@) to be a suitable limit. This 
approach will now be developed in detail. 

Let ¢ be the experiment which consists of one observation x from p(x | @). 
9 € OC R. Suppose that we are interested in reporting inferences about 6 and that 
no restrictions are imposed on the form of the prior distribution p(@). It is easily 
verified that the amount of information about 4 which & independent replications 
of c may be expected to provide may be rewritten as 


(0) 
I {e(k). p(@ =| 8) tog LE) ag. 
{e(k).p(9)} = | p(9) a0) 
where 
Sx (@) = exp { | vf2e10) 105010 | ze\dex 
and z,; = {a),....2,} is a possible outcome from ¢(4), so that 


k 
pO | zx) x [J p(x, |8)n() 


1=1 


is the posterior distribution for @ after z;, has been observed. Moreover, for any 
prior p(9) one must have the constraint [ p(@)d@ = 1 and, therefore, the prior 
7m,(@) which maximises /°{¢(4). p(@)} must be an extremal of the functional 


F{p(-)} = fv) log ft 0 + X { | 0) - i}. 


Since this is of the form F{p(-)} = f 9{p(-)} d@, where, as a functional of p(-), 
g is twice continuously differentiable, any function p(-) which maximises F' must 
satisfy the condition 


2 (pl) a er(-)}f =O forall r. 
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It follows that, for any function 7, 


p(8) 
fi) 


} {r@) log fx(8) + fi(9) — 7(8) (1 + log p(9)) + r(o)a} dé = 0, 


where, after some algebra, 


119) - 2. p(z | 8){p(9) + €r(9)} 
He) = 5 {oe |] oe 06 ety) rernh |} 


= f,(9) 2) . 


e=0 


Thus, the required condition becomes 


[ro {log f, (8) — log p(@) + A} d? =0, forall 7(6), 
which implies that the desired extremal should satisfy, for all 8 € ©, 
log fx(9) — log p(@) + A = 0 


and hence that p(@) ox f,(6). 

Note that, for each k, this only provides an implicit solution for the prior 
which maximises I°{e(k),p(9)}, since f,(9) depends on the prior through the 
posterior distribution p(6| zs.) = p(@|a1,...,a). However, for large values of 
k, an approximation, p*(6 | z;), say, may be found to the posterior distribution of 
@, which is independent of the prior p(@). It follows that, under suitable regularity 
conditions, the sequence of positive functions 


Pi(0) = exp{ ple |) logp"(@2,)dzx} 
will induce, by formal use of Bayes’ theorem, a sequence of posterior distributions 
mx(8| 2) x p(x | 0)p;(8) 
with the same limiting distributions that would have been obtained from the se- 
quence of posteriors derived from the sequence of priors 7.(@) which maximise 


I°{e(k),p(@)}. This completes our motivation for Definition 5.7. For further 
information see Bernardo (1979b) and ensuing discussion. 
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Definition 5.7. (One-dimensional reference distributions). 

Let x be the result of an experiment e which consists of one observation from 
p(x@l9).2 E X,OEOCR, lerz, = {x..... xx} be the result of k 
independent replications of €, and define 


f¢(0) = exp { i p( 2x, |6) log p(8 | z1)das} 


where : 

Ti, p(@.18) 
J TT. p(w, 16) 0 
The reference posterior density of 6 after x has been observed ts defined to be 
the log-divergence limit, 7(@|x), of 1,(@|a), assuming this limit to exist, 
where 


p'(8| zs) = 


m(O| x) = x(a) p(x |9) f; (8). 


the cy(x)’s are the required normalising constants and, for almost all x. 


7(O | xr) 


x6] x) dé = 0. 


jin [mle) log 


Any positive function 7(8) such that, for some c(z) > Oand for all 8 € 9, 
m(O| x) = c(x) p(x |) 7() 
will be called a reference prior for 0 relative to the experiment €. 


It should be clear from the argument which motivates the definition that any 
asymptotic approximation to the posterior distribution may be used in place of 
the asymptotic approximation p*(@|z,) defined above. The use of convergence 
in the information sense, the natural convergence in this context, rather than just 
pointwise convergence, is necessary to avoid possibly pathological behaviour; for 
details, see Berger and Bernardo (1992c). 

Although most of the following discussion refers to reference priors, it must be 
stressed that only reference posterior distributions are directly interpretable in prob- 
abilistic terms. The positive functions 7(@) are merely pragmatically convenient 
tools for the derivation of reference posterior distributions via Bayes’ theorem. An 
explicit form for the reference prior is immediately available from Definition 5.7, 
and it will be clear from later illustrative examples that the forms which arise may 
have no direct probabilistic interpretation. 

We should stress that the definitions and “propositions” in this section are by 
and large heuristic in the sense that they are lacking statements of the technical 
conditions which would make the theory rigorous. Making the statements and 


5.4 Reference Analysis 307 


proofs precise, however, would require a different level of mathematics from that 
used in this book and, at the time of writing, is still an active area of research. The 
reader interested in the technicalities involved is referred to Berger and Bernardo 
(1989, 1992a, 1992b, 1992c) and Berger et al. (1989). So far as the contents of this 
section are concerned, the reader would be best advised to view the procedure as an 
“algorithm, which compared with other proposals —discussed in Section 5.6.2— 
appears to produce appealing solutions in ali situations thus far examined. 


Proposition 5,18. (Explicit form of the reference prior). 

A reference prior for 0 relative to the experiment which consists of one obser- 
vation from p(x |0),2 € X,0E€ OCR, is given, provided the limit exists, 
and convergence in the information sense is verified, by 


GO. 8 


, 


n(@) =c lim 


koe fi (00) 
where c > 0, 0) € 9, 


fi(0) = exp { [ rGrie)ioen"C zi)dex} . 


with 2, = {2,...,2%%} a random sample from p(x | 6), and p*(0| zx) is an 
asymptotic approximation to the posterior distribution of 9. 


Proof. Using 7(@) as a formal prior, 


_ SFe(9) p(t | 0) FZ (9) 
m(B| x) x p(x | @)n(4) ple | A) lim rg) x Jinn, Fowl 6) fe(6) a , 


and hence 
m(8|x) = lim m(9{ x), me (B| x) x p(w] 9) fe (9) 


as required. Note that, under suitable regularity conditions, the limits above will 
not depend on the particular asymptotic approximation to the posterior distribution 
used to derive ff(0). 


If the parameter space is finite, it turns out that the reference prior is uniform, 
independently of the experiment performed. 


Proposition 5.19. (Reference prior in the finite case). Let x be the result 
of one observation from p(x|@), where @ € © = {01,..., On}. Then, any 
function of the form 1(0;) = a,a > 0,i =1,...,M, is a reference prior and 
the reference posterior is 


n(0, | a) = c(x)p(a|0;), t=1,...,M 


where c(a) is the required normalising constant. 
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Proof. We have already established (Proposition 5.12) that if © is finite then. 
for any strictly positive prior, p(6; | r1...... t;) will converge to 1 if 4, is the true 
value of 0. It follows that the integral in the exponent of 


fr(0;) = exp { [ ote | 6;) log p(Q; | zi)dzx} » Sb TONE 
will converge to zero as k — oc. Hence, a reference prior is given by 


n(@;) = jim fr(O;) 


The general form of reference prior follows immediately. q 


The preceding result for the case of a finite parameter space is easily derived 
from first principles. Indeed, in this case the expected missing information is finite 
and equals the entropy 


M 
H{p(4)} = — }- p(8:) log p(6:) 
=] 


of the prior. This is maximised if and only if the prior is uniform. 


The technique encapsulated in Definition 5.7 for identifying the reference prior 
depends on the asymptotic behaviour of the posterior for the parameter of interest 
under (imagined) replications of the experiment to be actually performed. Thus far. 
our derivations have proceeded on the basis of an assumed single observation from 
a parametric model, p(x | #). The next proposition establishes that for experiments 
involving a sequence of n > 1 observations, which are to be modelled as if they 
are a random sample, conditional on a parametric model, the reference prior does 
not depend on the size of the experiment and can thus be derived on the basis of 
a single observation experiment. Note, however, that for experiments involving 
more structured designs (for example, in linear models) the situation is much more 
complicated. 


Proposition 5.20. (Independence of sample size). 

Lete,.n > 1, be the experiment which consists of the observation ofa random 
sample Z,..... zx, from p(x|0).x € X, 6 € O, and let P,, denote the 
class of reference priors for 0 with respect to e,,, derived in accordance with 
Definition 5.7, by considering the sample to be a single observation from 
The; p(&. | 9). Then P; = P,,. for all n. 
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Proof. \f z, = {a,,..., 2x} is the result of a k-fold independent replicate of 
€1, then, by Proposition 5.18, P; consists of 7(@) of the form 
n(0) =clim 4 (9) 


with c > 0, 6, 6 € © and 
fi() = exp { [ r(zx1)10¢7°(01 21) aes} 


where p*(@|z,) is an asymptotic approximation (as k — oo) to the posterior 
distribution of @ given z;,. 

Now consider Zn4 = {@)..--5@n,2n¢iy-++)Lan:---, Lin} which can be 
considered as the result of a k-fold independent replicate of €,,, so that P, consists 
of 7(6) of the form 

(6) =c lim Srx(9) 


k-+90 f*,.(90) 
But z,, can equally be considered as a nk-fold independent replicate of e) and so 
the limiting ratios are clearly identical. 


In considering experiments involving random samples from distributions ad- 
mitting a sufficient statistic of fixed dimension, it is natural to wonder whether the 
reference priors derived from the distribution of the sufficient statistic are identical 
to those derived from the joint distribution for the sample. The next proposition 
guarantees us that this is indeed the case. 


Proposition 5.21. (Compatibility with sufficient statistics). 

Let é,,n > 1, be the experiment which consists of the observation ofa random 
sample 2\,...,&, from p(x|0),x2% € X,0 € O, where, for all n, the latter 
admits a sufficient statistic t,, = t(a,,...,%,). Then, for any n, the classes 
of reference priors derived by considering replications of (@,,...,2%n) andt,, 
respectively, coincide, and are identical to the class obtained by considering 
replications of e,. 


Proof. If z, denotes a k-fold replicate of (a,,...,%,) and y, denotes the 
corresponding k-fold replicate of t,,, then, by the definition of a sufficient statis- 
tic, p(O|z.) = p(|y,), for any prior p(8). It follows that the corresponding 
asymptotic distributions are identical, so that p*(9| z,) = p*(@| y,). We thus have 


fi(8) = exp { [ v(2x 10) 106 9"(6 | 21)¢zx) 
= exp{ [ p(ex|6)I0g7"(6lux)aze} 


exp { / P( ys | 9) logp*(8|us)ave} 
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so that, by Definition 5.7. the reference priors are identical. Identity with those 
derived from €; follows from Proposition 5.20.  g 


Given a parametric model, p(x|@). 1 € X. 0 € O, we could. of course, 
reparametrise and work instead with p(x |©)..r € X,o = @(@). for any monotone 
one-to-one mapping y : © — ®. The question now arises as to whether refer- 
ence priors for 6 and @, derived from the parametric models p(x | @) and p(x |). 
respectively, are consistent, in the sense that their ratio is the required Jacobian 
element. The next proposition establishes this form of consistency and can clearly 
be extended to mappings which are piecewise monotone. 


Proposition 5.22. (Invariance under one-to-one transformations). 
Suppose that m9(@), ™.(@) are reference priors derived by considering repli- 
cations of experiments consisting of a single observation from p(x |@), with 
xz € X,6 € Oand from p(x| 6), with r € X.o € ®, respectively, where 
@ = g(0) and g : 8 — ® is a one-to-one monotone mapping. Then, for some 
c > Oand forall d€ ®: 


(i) (6) = ene (g7'(0)). if © is discrete; 


Dao tte 
(ii) T (0) = CN (9 "(o)) i ae if J, = Og (@) 


00 

Proof. lf 9 is discrete, so is © and the result follows from Proposition 5.19. 
Otherwise, if z, denotes a k-fold replicate of a single observation from p(x | 4). 
then. for any proper prior p(@), the corresponding prior for © is given by p,(O) = 
pe (g~'(@)) |Jo{ and hence, for all o € ®, 

Pr(O| Ze) = po (9? '() | 24) Lol 

It follows that, as k — oc, the asymptotic posterior approximations are related by 
the same Jacobian element and hence 


f(8) = exp {| P(x | 8) log (| zu)des} 
= Vortex { f r(zs |) log p"(@ zs)de,} 
= |Jol! Fi (0). 


The second result now follows from Proposition 5.18. g 


EXISTS. 


The assumed existence of the asymptotic posterior distributions that would 
result from an imagined k-fold replicate of the experiment under consideration 
clearly plays a key role in the derivation of the reference prior. However. it is 
important to note that no assumption has thus far been required concerning the 
form of this asymptotic posterior distribution. As we shall see later, we shall 
typically consider the case of asymptotic posterior normality, but the following 
example shows that the technique is by no means restricted to this case. 
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Example 5.12. (Uniform model). Let e be the experiment which consists of observing 
the sequence x)....,,,n > 1, whose belief distribution is represented as that of a random 
sample from a uniform distribution on (@ — 5.0+ 4],@ € R, together with a prior distribution 
p(@) for @. If 


(n) (n) (np ’ - {ny} ' 4 
i= Ee | » Ing =Minfr....,¢y}. oy, = max{x,....cy}, 


max 
then t,, is a sufficient statistic for 6, and 


(n) (n)} 
po | x) =. p(é | t,) x pd). Tinax ee. } < 0 Se Fnin +t 4 ‘ 


It follows that, as & —+ oc, a k-fold replicate of e with a uniform prior will result in the 
posterior uniform distribution 


p(é | ten} xc, oh) an 


It is easily verified that 


/ P( tin |8) 10g p(B | thn)dtin, = E | log {1 ~ (xr — al? yb 


‘| 


the expectation being with respect to the distribution of ¢;,,. For large k, the right-hand side 
is well-approximated by 


~tog {1 ~ (@[24ar | - [sts] 


and, noting that the distributions of 


yh) a H 4 (kn) 


= 1 
U >= Lipay U= Lain — + 2 


are Be(u | An. 1) and Be(v| 1, kn), respectively, we see that the above reduces to 


Lapis kn % 1 ahs kn +1 : 
8 kn+1 kn+i) © 2 


It follows that f;,,(@) = (An + 1)/2, and hence that 


(kn + 1)/2 


n(0) =C¢ jim x (kn +1)/2— 


Any reference prior for this problem is therefore a constant and, therefore, given a set of 
actual data z = (21,...,2,), the reference posterior distribution is 

m(O|a)xc, tne —3 SOS tun +4> 
a uniform distribution over the set of @ values which remain possible after 2 has been 
observed. 
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Typically, under suitable regularity conditions, the asymptotic posterior dis- 
tribution p*(@| 2,,,). corresponding to an imagined k-fold replication of an exper- 
iment €, involving a random sample of n from p(a | @), will only depend on 2,,, 
through an asymptotically sufficient, consistent estimate of 8, a concept which is 
made precise in the next proposition. In such cases, the reference prior can easily 
be identified from the form of the asymptotic posterior distribution. 


Proposition 5.23, (Explicit form of the reference prior when there is a 
consistent, asymptotically sufficient, estimator). Let ¢,, be the experiment 
which consists of the observation of a random sample x = {Zy..... x,,} from 
p(w |0).2 € X.8€ OCR, and let zx, be the result of a k-fold replicate of 
¢,,. Uf there exists Bin = Bin (Zhen) such that, with probability one 


lim 6;, = @ 
k-=x 


and,ask x, 


, sete og Oe i, 6. 
Pp (@ | Pen ) 


then, for anv c > 0. Ay € 9, reference priors are defined by 


ye ian dent) 
m{O) = c jim FO) . 


where 


Fin(9) = P16) |, 


Proof. As k — ox, it follows from the assumptions that 


ft,(8) =exp { J rlzt. | 8) log p* (4 | Zin dein} 
=exp { [ vz |) log p"(@ | Bhn Jdzx,, \ 
=exp { i P(r, | A) log p* (8 | Gn} 
= exp {lee (6 | avnl|, _y} =“ p (8 | 7 : 


The result now follows from Proposition 5.18. 
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Example 5.13. (Deviation from uniformity model). Let ¢,, be the experiment which 
consists of obtaining a random sample from p(x | @).0 < x < 1,4 > 0, where 


O{2x}%! for O<4S$3 
p(z|9) = 


6{2(1—zx)}""' for 5<x<1 
defines a one-parameter probability model on [0. 1], which finds application (see Bernardo 
and Bayarri, 1985) in exploring deviations from the standard uniform model on (0. 1] (given 
by 6 = 1). 
It is easily verified that if z,,, = {x)..... Xin} results from a k-fold replicate of ¢,,, 
the sufficient statistic t,, is given by 
kn 


thn = ~— » {log{2z, } 1juiyy(a,) + log{2(1 — 21) Hyra.ay(xi)} 


and, for any prior p(@), 
P(9| Zen) = PCO | ten) 
x p(0)0" exp{—kn(O — 1)tyn}- 
It is also easily shown that p(t, | 9) = Ga(t,, | 4, kn8), so that 
1 
kn@ 
from which we can establish that 4,,, = t;,| is a sufficient, consistent estimate of 9. It follows 


that Z 
p(e| Bn) x 6 exp - a } 
kn 
provides, for large &, an asymptotic posterior approximation which satisfies the conditions 
required in Proposition 5.23. From the form of the right-hand side, we see that 
(0 | On) = Ga(O| kn + 1, kn/O gn) 


= (kn /Okn)**! —knd 
ney PN a 


1 P 
Elten |] = e: V [tin 4] = 


Gen 


so that 
(kn) theo vk 


agro Pan +18 
and, from Proposition 5.18, for some c > 0, 4) > 0, 


fin (9) = pP (6 | xn) 


- fi (8) cOy 1 
a(0) =¢ lim => =—x-- 
ae fi,(00) 8 @ 
The reference posterior for @ having observed actual data a = (21,....2,), producing the 
sufficient statistic ¢ = ¢(x), is therefore 
1 
n(8|x) = n(0|t) x ple) 5 


x 6"! exp{—n(6 — 1)t}, 
which is a Ga(@ | n. nt) distribution. 
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Under regularity conditions similar to those described in Section 5.2.3, the 
asymptotic posterior distribution of @ tends to normality. In such cases, we can 
obtain a characterisation of the reference prior directly in terms of the parametric 
model in which @ appears. 


Proposition 5.24. (Reference priors under asymptotic normality). 

Let e€,, be the experiment which consists of the observation of arandom sample 
Bye... x, from p(x |@), 2 € X,6 € OC R. Then, if the asymptotic 
posterior distribution of 8, given a k-fold replicate of €,, is normal with 
precision knh(@gn). where 6, is a consistent estimate of 0, reference priors 
have the form 


(8) x {h(B)} #2. 


Proof. Under regularity conditions such as those detailed in Section 5.2.3, it 
follows that an asymptotic approximation to the posterior distribution of 6, given a 
k-fold replicate of e,,, is 


p*(0|9kn) = N (9 | Oru knh(Oin)) 
where 6, is some consistent estimator of 9. Thus, by Proposition 5.23, 


Fin(8) = p*(8| Bx») 


Bhon =8 


= (2) mo 


Qn 
and therefore, for some c > 0, A € 9, 


_ oa Sin(O) _— {h(@)})? 
m(@) = ¢ lim F (60) — TRO) ye? 


x {h(a}? ?, 


as required. g 


The result of Proposition 5.24 is closely related to the “rules” proposed by 
Jeffreys (1946, 1939/1961) and by Perks (1947) to derive “‘non-informative” priors. 
Typically, under the conditions where asymptotic posterior normality obtains we 
find that 


og 
h(@) = [vie (-H log (2) daz. 
i.e., Fisher's information (Fisher, 1925), and hence the reference prior. 


x(@) x h(@)'?. 
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becomes Jeffreys’ (or Perks’) prior. See Polson (1992) for a related derivation. 

It should be noted however that, even under conditions which guarantee asymp- 
totic normality, Jeffreys’ formula is not necessarily the easiest way of deriving a 
reference prior. As illustrated in Examples 5.12 and 5.13 above, it is often sim- 
pler to apply Proposition 5.18 using an asymptotic approximation to the posterior 
distribution. 

Itis important to stress that reference distributions are, by definition, a function 
of the entire probability model p(x | 0), 2 € X,@ € 9, not only of the observed 
likelihood. Technically, this is a consequence of the fact that the amount of infor- 
mation which an experiment may be expected to provide is the value of an integral 
over the entire sample space X, which, therefore, has to be specified. We have, of 
course, already encountered in Section 5.1.4 the idea that knowledge of the data 
generating mechanism may influence the prior specification. 


Example 5.14. (Binomial and negative binomial models). Consider an experiment 
which consists of the observation of n Bernoulli trials, with n fixed in advance, so that 
w= {x),...,%n}, 


p(z|0) =67(1-0)'"*, xe {0,1}, 05S), 


& 
n(@ == Spl 10) % tog 2] 6) = aay", 


z=0 


and hence, by Proposition 5.24, the reference prior is 
n(0) x O21 — 9), 
If r = S7j_, zi, the reference posterior, 
m(O| x) x p(x} @)r(0) « 8 /2(1 — ayn? 


is the beta distribution Be(@|r + 5, — r+ 5). Note that 7(0| 2) i : proper, whatever the 
number of successes r. In particular, if r = 0, r(@|a) = Be(@|3,n + 4), from which 
sensible inference summaries can be made, even though there are no observed successes. 
(Compare this with the Haldane (1948) prior, 7(@) x @-'(1 — @)-', which produces an 
improper posterior until at least one success is observed.) 

Consider now, however, an experiment which consists of counting the number z of 
Bernoulli trials which it is necessary to perform in order to observe a prespecified number 
of successes, 7 > 1. The probability model for this situation is the negative binomial 


p(x |6) = (F7))ea- Gs ee Ree es 
from which we obtain 


h(8) = — So ple |0)-25 log nx |) = v8 "(1 - 6)" 


tur 
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and hence, by Proposition 5.24, the reference prior is 7(9) x 6 '(1— 8) '*. The reference 
posterior is given by 


n(O |r) x pic [@)r(@) xO 1 - Oyo kr aernrstl... 


which is the beta distribution Be(@ |r... — r + 5). Again, we note that this distribution is 
proper, whatever the number of observations x required to obtain r successes. Note that 
r = (is not possible under this model: the use of an inverse binomial sampling design 
implicitly assumes that r successes wii// eventually occur for sure. which is not true in direct 
binomia! sampling. This difference in the underlying assumption about @ is duly reflected 
in the slight difference which occurs between the respective reference prior distributions. 

See Geisser (1984) and ensuing discussion for further analysis and discussion of this 
canonical example. 


In reporting results, scientists are typically required to specify not only the 
data but a/so the conditions under which the data were obtained (the design of 
the experiment), so that the data analyst has available the full specification of the 
probability model p(a |), 2 € X.@ € 9. In order to carry out the reference 
analysis described in this section, such a full specification is clearly required. 

We want to stress, however. that the preceding argument is totally compatible 
with a full personalistic view of probability. A reference prior is nothing but 
a (limiting) form of rather specific beliefs: namely, those which maximise the 
missing information which a particular experiment could possibly be expected to 
provide. Consequently, different experiments generally define different types of 
limiting beliefs. To report the corresponding reference posteriors (possibly for a 
range of possible alternative models) is only part of the general prior-to-posterior 
mapping which interpersonal or sensitivity considerations would suggest should 
always be carried out. Reference analysis provides an answer to an important 
“what if?” question: namely, what can be said about the parameter of interest 
if prior information were minimal relative to the maximum information which a 
well-defined, specific experiment could be expected to provide? 


5.4.3 Restricted Reference Distributions 


When analysing the inferential implications of the result of an experiment for a 
quantity of interest, 8. where, for simplicity. we continue to assume that @ € O CR, 
it is often interesting, either per se. or on a “what if?” basis, to condition on 
some assumed features of the prior distribution p(@), thus defining a restricted 
class, Q, say, of priors which consists of those distributions compatible with such 
conditioning. The concept of a reference posterior may easily be extended to 
this situation by maximising the missing information which the experiment may 
possibly be expected to provide within this restricted class of priors. 
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Repeating the argument which motivated the definition of (unrestricted) ref- 
erence distributions, we are led to seek the limit of the sequence of posterior dis- 
tributions, 7;(6 | a), which correspond to the sequence of priors, 7;.(8), which are 
obtained by maximising, within Q, the amount of information 


1 {e(k),p(6)} = [ p(0) tog 2 a, 


where 


fx(8) = exp { / p( 24 |) log p(6 ai)dey} 


which could be expected from k independent replications z = {a,...,2,} of the 
single observation experiment. 


Definition 5.8. (Restricted reference distributions). 

Let x be the result of an experiment e which consists of one observation from 
p(x|9), 2 € X, with@ € 8 C R, let Q be a subclass of the class of all 

prior distributions for 0, let z, = {x,..... a} be the result of k independent 

replications of e and define 


ft(0) = exp { [reeio) log" (0| zu)des} , 


where ; 
: 6) 
p’(O|z,) = ITi- 1 P(x; | 9) 
rage 1 P(x: | @) dé 


Provided it exists, the Q-reference posterior distribution of 0, after x has been 
observed, is defined to be x®(0| x), such that 


Eld{xe (0 |x), 72(0|x)}] +0, as k 00, 


me (Ox) x pla! @)n? (8), 


where 6 is the logarithmic divergence specified in Definition 5.7, and ne (8) 
is a prior which minimises, within Q 


(6) 
Je (2) log ay 


A positive function n° (@) in Q such that 
n?(O|x) x p(x|0)x?(0), for all@ € 8, 


is then called a Q-reference prior for relative to the experiment e. 
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The intuitive content of Definition 5.8 is illuminated by the following result. 
which essentially establishes that the Q-reference prior is the closest prior in Q 
to the unrestricted reference prior 7(0), in the sense of minimising its logarithmic 
divergence from 7(@). 


Proposition 5.25. (The restricted reference prior as an approximation). 
Suppose that an unrestricted reference prior (0) relative to a given experi- 
ment is proper; then, if it exists, a Q-reference prior T(@) satisfies 


r2(8) p(@) 
fe 269 ) log = 76) at = inf p(8) log “ay do. 


Proof. \t follows from Proposition 5.18 that 7(@) is proper if and only if 


[se d= <m. 


in which case, 
(0) = lim 7(6) = lim cj! ff (9). 
ka kena 


Moreover, 


FiO’. 'p(4) 
J r(0) 108 716) dd = J r(0) 108 log = ct pay 


=toger ~ [ p(6) tog on 


dé. 


which is maximised if the integral is minimised, Let 7, ?(@) be the prior which 
minimises the integral within @. Then, by Definition 5.8. 


n?(O|r) x p(r|@) jim me (0) = p(x |@)r?(8). 


where. by the continuity of the divergence functional. 7°(@) is the prior which 
minimises, within Q, 


p(9) - pl) 
[ve log {afta Tim m-(6) dé = [vo log {aa (8) a dé. 
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If 7(@) is not proper, it is necessary to apply Definition 5.8 directly in order to 
characterise 1? (0). The following result provides an explicit solution for the rather 
large class of problems where the conditions which define Q may be expressed as 
a collection of expected value restrictions. 


Proposition 5.26. (Explicit form of restricted reference priors). 

Let e be an experiment which provides information about 6, and, for given 
{(gi(-),3;), @ = 1,..., m}, let Q be the class of prior distributions p(6) of 6 
which satisfy 


i, 4:(9)p(0)\d8 = B, i=lj....m. 


Let 1(0) be an unrestricted reference prior for @ relative to e; then, a Q- 
reference prior of @ relative to €, if it exists, is of the form 


a? (0) x (8) exp >» aio) ‘ 
i=l 
where the 2,;’s are constants determined by the conditions which define Q. 


Proof. The calculus of variations argument which underlay the derivation of 
reference priors may be extended to include the additional restrictions imposed by 
the definition of Q, thus leading us to seek an extremal of the functional 


[ v1@106 HS? a0 +f [ ote) a0 - 1} ook { [ 9) 9 a6 ~ 9,\, 


i=l 


corresponding to the assumption of a k-fold replicate of e. A standard argument 
now shows that the solution must satisfy 


log f; (8) — log p(6) + A+ D> Aigi(8) = 0 


t=] 
and hence that 


p(8) x fi (6) exp 13 aio} 


i=1 


Taking k — oo, the result follows from Proposition 5.18.  q 
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Example 5.15. (Location models). Let x = {.r)...... r,,} be arandom sample from 
a location model p(a: [| 9) = h(.r — 0). € RA € R, and suppose that the prior mean and 
variance of @ are restricted to be E[6] = ju), V [0] = 02. Under suitable regularity conditions, 
the asymptotic posterior distribution of @ will be of the form p'(@{.2,...... r,) x £(0, -—4). 
where @,, is an asymptotically sufficient, consistent estimator of 6. Thus, by Proposition 5.23. 


n(@) x p'(914,)! x FO). 


lo, -e 


which is constant, so that the unrestricted reference prior will be uniform. It now follows 
from Proposition 5,26 that the restricted reference prior will be 


(8) x exp {A\@ + A(0 - ju)?} ; 


with f Ox°(@) d@ = jy and [(@ — py)?x? (0) dO = of. Thus, the restricted reference prior 
is the normal distribution with the specified mean and variance. 


5.4.4 Nuisance Parameters 


The development given thus far has assumed that @ was one-dimensional and that 
interest was centred on @ or on a one-to-one transformation of 8. We shall next 
consider the case where @ is two-dimensional and interest centres on reporting 
inferences for a one-dimensional function, @ = @(@). Without loss of generality. 
we may rewrite the vector parameter in the form @ = (6.A),0 € ® AE A, 
where © is the parameter of interest and 4 is a nuisance parameter. The problem is 
to identify a reference prior for @, when the decision problem is that of reporting 
marginal inferences for %, assuming a logarithmic score (utility) function. 

To motivate our approach to this problem. consider z; to be the result of a 
k-fold replicate of the experiment which consists in obtaining a single observation, 
x, from p(x|@) = p(x|o.A). Recalling that p(@) can be thought of in terms of 
the decomposition 

P(O) = plo. A) = p(d)p(A |). 


suppose, for the moment, that a suitable reference form, (|), for p(A|o) has 
been specified and that only 7(@) remains to be identified. Proposition 5.18 then 
implies that the “marginal reference prior” for ~ is given by 


r(o) x Jim [fe(0)/fi(@u)]. e.0y € 


where 


fi(@) = exp { [w= |e) logy (0| zu dau} , 
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p* (| Ze) is an asymptotic approximation to the marginal posterior for @, and 


peLley= . p(2u |, d)m(A| 6) dd 


k 
= [ T]ete:14.aynajo) ar. 
t=1 


By conditioning throughout on ¢, we see from Proposition 5. 18 that the “conditional 
reference prior” for \ given @ has the form 


FOL) 


a a a Fase 


ko 


|. A,A9 EC A, GE ®, 


where 
Ri Sep { [velo rtogr"(alo, zi)dzs}, 


p*(A| ¢, 2%) is an asymptotic approximation to the conditional posterior for given 
¢, and 


k 
p(z% 1d, A) = [[>(: | dg, d). 
i=} 


Given actual data x, the marginal reference posterior for ¢, corresponding to 
the reference prior 


(8) = 1(,A) = 7(G) (| 9) 
derived from the above procedure, would then be 


m(o| x) a f m(0,d]2)ad 


x (4) it p(a| o,d)m(A|d)ad. 


This would appear, then, to provide a straightforward approach to deriving reference 
analysis procedures in the presence of nuisance parameters. However, there is a 
major difficulty. 

In general, as we have already seen, reference priors are typically not proper 
probability densities. This means that the integrated form derived from 7(A | ¢), 


plz {d) = / p(24 1d, )m(A|4) dd. 


which plays a key role in the above derivation of 7(@), will typically not be a proper 
probability model. The above approach will fail in such cases. 

Clearly, a more subtle approach is required to overcome this technical problem. 
However, before turning to the details of such an approach, we present an example, 
involving finite parameter ranges, where the approach outlined above does produce 
an interesting solution. 
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Example 5.16. (Induction). Consider a large, finite dichotomised population. al} of 
whose elements individually may or may not have a specified property. A random sample is 
taken without replacement from the population, the sample being large in absolute size, but 
still relatively small compared with the population size. A// the elements sampled turn out 
to have the specified property. Many commentators have argued that. in view of the large 
absolute size of the sample, one should be led to believe quite strongly that all elements of 
the population have the property. irrespective of the fact that the population size is greater 
still, an argument related to Laplace's rule of succession. (See. for example, Wrinch and 
Jeffreys, 1921, Jeffreys. 1939/1961, pp. 128-132 and Geisser, 1980a.) 

Let us denote the population size by 'V, the sample size by 7, the observed number 
of elements having the property by x, and the actual number of elements in the population 
having the property by @. The probability model for the sampling mechanism is then the 
hypergeometric. which, for possible values of 2. has the form 


6 N-6 
() ( n-r ) 
() 
If p(@ = r), r = 0,...,. defines a prior distribution for 0, the posterior probability that 
@ = N, having observed x = n, is given by 


p(x |) = 


p(x =n|0 = N)p(6 = N) 


CNS MS eae eA PEO) 


Suppose we considered @ to be the parameter of interest. and wished to provide a reference 
analysis. Then, since the set of possible values for @ is finite, Proposition 5.19 implies that 


1 


O@=r)=——. r=0,1.....N. 
p(9 = r) Wai D. 2 


is a reference prior. Straightforward calculation then establishes that 


y n+1 
p(6@=Nl/r=n) = ——-: 
N+1 
which is not close to unity when n is large but 2/N is small. 
However, careful consideration of the problem suggests that it is not @ which is the 
parameter of interest: rather it is the parameter 


__ fl if6=Nn 
~ 10 ifO#N. 


To obtain a representation of @ in the form (@. A). let us define 


eee 1 if@iaNn 
~ 1e fOXN. 
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By Proposition 5.19, the reference priors 7(@) and 7(A|@) are both uniform over the ap- 
propriate ranges, and are given by 


n(6 =0) = (9 =1) =}, 


CQes nai, mA=rlo=0)= 55 aS ane ee 


These imply a reference prior for 9 of the form 


1 

= if@=N 
p= 4% 

oy if@#N 


and straightforward calculation establishes that 


1 n\]7)  n+1 
p= Njz=n)= [1+ (1-5)| x ——> 


which clearly displays the irrelevance of the sampling fraction and the approach to unity for 
large n (see Bemardo, 1985b, for further discussion). 


We return now to the general problem of defining a reference prior for 8 = 
(¢, A), € ®, A € A, where ¢ is the parameter vector of interest and 4 is a nuisance 
parameter. We shall refer to the pair (¢, A) as an ordered parametrisation of the 
model. We recall that the problem arises because in order to obtain the marginal 
reference prior 7(¢) for the first parameter we need to work with the integrated 
model 


p(zx |) = [rleele.ryn(alo) dd. 


However, this will only be a proper model if the conditional prior (| ¢) for the 
second parameter is a proper probability density and, typically, this will not be the 
case. 

This suggests the following strategy: identify an increasing sequence {A;} 
of subsets of A, LU; A; = A, which may depend on ¢, such that, on each Aj, the 
conditional reference prior, 1(\|@) restricted to A; can be normalised to give a 
reference prior, 7;(A|@), which is proper. For each 7, a proper integrated model 
can then be obtained and a marginal reference prior 7;(¢) identified. The required 
reference prior 7(@, A) is then obtained by taking the limit as i — 00. The strategy 
clearly requires a choice of the A,’s to be made, but in any specific problem a 
“natural” sequence usually suggests itself. We formalise this procedure in the next 
definition. 


324 5 Inference 


Definition 5.9. (Reference distributions given a nuisance parameter). 

Let x be the result of an experiment e which consists of one observation from 
the probability model p(x|@.r), x € X,(@.A)E Px ACR R. The 
reference posterior, x(@| 2), for the parameter of interest ©, relative to the 
experiment e and to the increasing sequences of subsets of A, {A;(@)}.0 € ®. 
U, Ai(é) = A, is defined to be the result of the following procedure: 


(i) applying Definition 5.7 to the model p(x | @.X), for fixed @, obtain the 
conditional reference prior, 7{X| @), for A; 


(ii) for each &, normalise (A | @) within each A;(Q) to obtain a sequence of 
proper priors, 7;(A|); 


(iii) use these to obtain a sequence of integrated models 
p(e|o) = f p(x |. A)mi(A] 9) da: 
Ajo) 
(iv) use those to derive the sequence of reference priors 


Sea co tan dEO) 
TO) = ce Fala) 


Si (d) = exp{ [ ries |@) log" (0| zur} 


and, for data x, obtain the corresponding reference posteriors 
role) x m(0) | plwlodln(Alo) dX 
Ayah 


(v) define m(@| x) such that, for almost all x, 


7 (Q |x) 
lim [150 z) log — =0 
lim f 7;(o|a) 8 cblz) 
The reference prior, relative to the ordered parametrisation (@, A), is any 
positive function n(Q, A), such that 


t(o|z) x [oe | 0. A)m(@. A) dd. 
This will typically be simply obtained as 


NN Goan aaa) 


Ghosh and Mukerjee (1992) showed that, in effect, the reference prior thus 
defined maximises the missing information about the parameter of interest. ©. 
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subject to the condition that, given ¢, the missing information about the nuisance 
parameter, A, is maximised. 

In a model involving a parameter of interest and a nuisance parameter, the 
form chosen for the latter is, of course, arbitrary. Thus, p(x |, A) can be writ- 
ten alternatively as p(x |, ), for any » = ¥(¢, A) for which the transformation 
(¢,A) — (@, &) is one-to-one. Intuitively, we would hope that the reference pos- 
terior for ¢ derived according to Definition 5.9 would not depend on the particular 
form chosen for the nuisance parameters. The following proposition establishes 
that this is the case. 


Proposition 5.27. (Invariance with respect to the choice of the nuisance 
parameter). Let e be an experiment which consists in obtaining one ob- 
servation from p(x|¢,A), (6,4) € ®x A C Rx R, and let e' be an 
experiment which consists in obtaining one observation from p(x|\ ¢,~), 
(¢,0) € x UC Rx R, where (6,A) — (¢,W) is one-to-one trans- 
formation, with = gg(A). Then, the reference posteriors for @, relative to 


[e, {Ai(@)}] and [e’, {Vi(d)}], where Vi(6) = go{A.()}, are identical. 


Proof. By Proposition 5.22, for given ¢, 
To( 1d) = ma(go'() 14) | Jp-1(H) |, 


where 


Hence, if we define 
Wi(d) = {vi W = go(A), A € AL(O)} 


and normalise 7,.(¥ | &) over &,() and 7(g; '(w) | @) over A;(o), we see that the 
normalised forms are consistently related by the appropriate Jacobian element. If 
we denote these normalised forms, for simplicity, by 7;(A|@), 7:(¥|@), we see 
that, for the integrated models used in steps (iii) and (iv) of Definition 5.9, 


pi(z |) =f p(x |g, A)mi(A| ) dd 


7 | p(@|d,¥)mi(w| ) dv, 
(0) 


t 


and hence that the procedure will lead to identical forms of 7(@|[z). 


326 5 Inference 


Alternatively, we may wish to consider retaining the same form of nuisance 
parameter, \, but redefining the parameter of interest to be a one-to-one function 
of ¢. Thus, p(x | é, A) might be written as p(a|+.), where 7 = g(¢) is now the 
parameter vector of interest. Intuitively, we would hope that the reference posterior 
for > would be consistently related to that of @ by means of the appropriate Jacobian 
element. The next proposition establishes that this is the case. 


Proposition 5.28. (Invariance under one-to-one transformations). 

Let e be an experiment which consists in obtaining one observation from 
p(@|@.A), 6 € &, A € A, and let e’ be an experiment which consists in 
obtaining one observation from p(x |4.A), 7 € T.A € A. where y = g(o). 
Then, given data x, the oo posteriors for @ and ¥, relative to {e. {A;(o)}] 
and [e’. {®:(y)}], (4) = A:{g(6)} are related by: 


(i) m(¥{ 2) = mA(g° (4) |x). if ® is discrete; 
ag \(4 


i) CV 12) = mol") 12) IJ, Pdi) = exists 


Proof. In all cases, step (i) of Definition 5.9 clearly results in a conditional ref- 
erence prior t(\|@) = m(A| gq '(+)). Fordiscrete ®, A, 7,(@) and 77, (+) defined by 
steps (ii)—(iv) of Definition 5.9 are both uniform distributions. by Proposition 5.18. 
and the result follows straightforwardly. If J,-1(+) exists, 7(o) and 7;(4) defined 
by ii (ii}Hiv) of Definition 5.9 are related by the claimed Jacobian element. 
| J,-1(4) |. by Proposition 5.22, and the result follows immediately. 


In Proposition 5.23, we saw that the identification of explicit forms of reference 
prior can be greatly simplified if the approximate asymptotic posterior distribution 
is of the form 


p*(O| zx) = p*(O| 64). 


where 6;. is an asymptotically sufficient, consistent estimate of 9. Proposition 5.24 
establishes that even greater simplification results when the asymptotic distribution 
is normal. We shall now extend this to the nuisance parameter case. 


Proposition 5.29. (Bivariate reference priors under asymptotic normality). 

Let e,, be the experiment which consists of the observation of a random sample 
Zpveees Z, from p(x|.A), (@.A) € x A C Rx R, and ler {A,(o)} 
be suitably defined sequences of subsets of X. as required by Definition 5.9. 
Suppose that the joint asymptotic posterior distribution of (0. ), givenak-fold 
replicate of fn is multivariate normal with precision matrix kn (Okns Nek 

where (Orns Aen) is a consistent estimate of (@.) and suppose that h, j= 
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hi;(bkns Aen), t = 1,2, j = 1,2, is the partition of H corresponding to 4, d. 
Then 


m(A| @) x {he2(d, A)}"?; 


™(, A) = m(A| 6) lim eacsaey 


define a reference prior relative to the ordered parametrisation (@, A), where 


\, go € ®, 


1;(@) « exp Me Ti (A| 0) log ({ho(, a)}"?) aa} ‘ 


with 


m(A| 9) = e4(4)m (Ad) = RCT 


and 
hy = (Ru — ighgy har). 


Proof. Given ¢, the asymptotic conditional distribution of A is normal with 
precision knh22(dkn, An). The first part of Proposition 5.29 then follows from 
Proposition 5.24. 

_ Marginally, the asymptotic distribution of ¢ is univariate normal with precision 
knhg, where hy = (hii — Righsz ho). To derive the form of 7;(¢), we note that 
if z, € Z denotes the result of a k-fold replication of en, 


ff,(0) = exp { [ rer 0)¥060"(0) ai)dex} 


where, with 7;(A | ¢) denoting the normalised version of 7(A |) over A;(@), the 
integrand has the form 


i [/ p(zx |, A)m(A| ¢) a] log N(O| den krnhg) dz, 
z'IJa,(o) 
zs i m(A1)| [ (2rd, ) 10g N(6| din, krig)dze] da 
A, (0) Zz 


x I “ m(A| @) log ete 208]" dx, 


for large k, so that 
m:(@) = lim Sint?) 


~ k20 ff, (G0) 
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has the stated form. Since, for data x, the reference prior 7(@. A) is defined by 
nx(O|x) = lim 7;(@|x) x lim p,(x| @)7;() 
ix x 
x lim no) [ p(x] 0. A)e.(@)r(A | edd 
ioe A, 
x [oe |. A)m(@, A}dA. 
the result follows. = g 
In many cases, the forms of {/22(q. )} and {h.,(9. A)} factorise into products 


of separate functions of @ and 4, and the subsets { A, } do not depend on ©. In such 
cases, the reference prior takes on a very simple form. 


Corollary. Suppose that, under the conditions of Proposition 5.29, we choose 
a suitable increasing sequence of subsets {A,} of A, which do not depend on 
@, and suppose also that 
{ho (@A)}? = filO)gi(A). — {ha2(@.A)}P? = foo) 92(A). 
Then a reference prior relative to the ordered parametrisation (@. A) is 
m(@.A) x fi(©)ge(A) 
Proof. By Proposition 5.29, 7(A|@) x f2(@)g2(A). and hence 
wi{r | @) = a5g2(A). 


where a7! = f, go(A) dd. It then follows that 


x 6 fil). 


4;92(A) log[fi()g1(A)] al 


’ 


where b; = fy a,g2(A) log gi(A) dA, and the result easily follows. 


Example 5.17. (Normal mean and standard deviation). Let «,, be the experiment 
which consists in the observation of a random sample x = {.r)...... v,} from a normal 
distribution, with both mean, j;:, and standard deviation, 7, unknown. We shall first obtain a 
reference analysis for j:, taking o to be the nuisance parameter. 
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Since the distribution belongs to the exponential family, asymptotic normality obtains 
and the results of Proposition 5.29 can be applied. We therefore first obtain the Fisher 
(expected) information matrix, whose elements we recall are given by 


id “2 
hytara) = J N(x} wor?) {AEE DN ay, 
ed 


from which it is easily verified that the asymptotic precision matrix as a function of 0 = (j, 2) 


is given by 
ao? 0 
Ho(u.0) = ( seo) ; 
{hy (p,0)}? = a7}, 
{hoo(u.a)}!? = 207. 


This implies that 


m(a |p) x {hoo(p.o)}? x at, 


so that, for example, A; = {a;e"' < o < e'},i = 1,2,..., provides a suitable sequence 
of subsets of A = #* not depending on jz, over which 7(q | 4) can be normalised and the 
corollary to Proposition 5.29 can be applied. It follows that 

(4.0) = m(u)m(o| p) x1 x om! 


provides a reference prior relative to the ordered parametrisation (j1, 0). The corresponding 
reference posterior for jz, given 2, is 


(ya) x / pla | p.0)m(u,0) do 
x m(u) f TE Nte:|u.o)n(o |) da 
rt 
x fo exp {55 [(F — u)? + s”] } a ‘do 


x [xm rew{-3 [(F- )? + “| dd 


x [e+ (u— 2" 
= St(u{%,(n — 1)8-?,n — 1). 


where ns* = D(z; — F)’. 
If we now reverse the roles of and o, so that the latter is now the parameter of interest 
and j is the nuisance parameter, we obtain, writing @ = (¢, 1) 


20°? 0 
Hgem=("4 or): 
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so that {h, (0. j4)}!"? = V207!, hoo(o. 2)}!"? = 07! and, by a similar analysis to the above. 
mula) xo"! 


so that, for example, A; = {u;-e' < pe < e’}.7 = 1.2.... provides a suitable sequence 

of subsets of A = ¥ not depending on o, over which m(jz| a) can be normalised and the 

corollary to Proposition 5.29 can be applied. It follows that 
n(u,a) = a(a)a(p lo) x1 xo! 


provides a reference prior relative to the ordered parametrisation (co. .). The corresponding 
reference posterior for o, given x, is 


m(a|x) x [etna T(p.0) du 
x a(a) f TEN |u-0) luo) dy. 
im} 
the right-hand side of which can be written in the form 


oF ns? 2 n 45 
o ew {55} fo ‘exp {-s5(u-#)"} dy. 


Noting, by comparison with a N(y|%,nA) density, that the integral is a constant, and 
changing the variable to \ = o~?, implies that 


m(A| ax) x AMY?! exp { ins*r} 


= Ga (a 4(n- 1). ins?) : 


or, alternatively, 
m(Ans* |x) = Ga (ans? | 4(n ~ 1). i) 


= \?(Ans?|n — 1). 


One feature of the above example is that the reference prior did not, in fact, 
depend on which of the parameters was taken to be the parameter of interest. In the 
following example the form does change when the parameter of interest changes. 


Example 5.18. (Standardised normal mean). We consider the same situation as that 
of Example 5.17, but we now take ¢ = y/o to be the parameter of interest. If 7 is taken as 
the nuisance parameter (by Proposition 5.27 the choice is irrelevant), w = (6.0) = g(t.) 
is clearly a one-to-one transformation, with 


du ou 

_| 00 da |} _[{9a © 

Yo @)= | a0 =(¢ ') 
Oo Ua 
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and using Corollary 1 to Proposition 5.17. 


, z 1 ga! 
Hy(¥) = Jg-1(¥) Hol(g *(b)) Jg-1() = tea o-2(2 4 o) : 


Again, the sequence A; = {a;e"' < o < e'}.i = 1,2...., provides a reasonable basis for 


applying the corollary to Proposition 5.29. It is easily seen that 


1 h(@,a) | ¥? 

hold.o) (W2 = LOL 
| ho(d.o) | TE CSIEE 
| heo(d.o) |? oc (2+ 92)!o7!, 


x (24+ 97), 


so that the reference prior relative to the ordered parametrisation (¢, 7) is given by 
m(g,0) x (2+) '?0"!. 


In the (4, 0) parametrisation this corresponds to 


we -1/2 
-2 
n(U,o) Xx (2+ 4) a". 


which is clearly different from the form obtained in Example 5.17. Further discussion of 
this example will be provided in Example 5.26 of Section 5.6.2. 


We conclude this subsection by considering a rather more involved example, 
where a natural choice of the required A;(¢) subsequence does depend on @. In 
this case, we use Proposition 5.29, since its corollary does not apply. 


Example 5.19. (Product of normal means). Consider the case where independent 
random samples x = {z;,...,2,} andy = {y,..., Ym } are to be taken, respectively, from 
N(x |«, 1) and N(y| 3,1), & > 0, 3 > 0, so that the complete parametric model is 


m 


p(x. y|a,3) = T[ N(zila.d [T] NG 13,0. 
ist 


= 


for which, writing @ = (a. 3) the Fisher information matrix is easily seen to be 


Ho(0) = H(a.8) = c ae 


Suppose now that we make the one-to-one transformation yw = (¢,A) = (a/.a/3) = 
g(a. 3) = g(@), so that @ = a is taken to be the parameter of interest and A = a/{3 is 
taken to be the nuisance parameter. Such a parameter of interest arises, for example, when 
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inference about the area of a rectangle is required from data consisting of measurements of 
its sides. 
The Jacobian of the inverse transformation is given by 


do Oa (3) (2) : 
_1 06 Or’. \_1 6 r 
Jg-() = a3 a3 ve 2 1 12 l fo 12 
Jo OX or ~y X 


and hence, using Corollary | to Proposition 5.17 


Hy o) = Jo (b)Holg (Wb) )Jg-() = 55 | ° “ 


nm 


~ 42 


512 2 
m(A| @) x Jhoo(@. A)? x (nmo)'* a 4 *) : 
A mud 


+ so that 


with | Ey (| 


The question now arises as to what constitutes a “natural” sequence {A;(®)}. over which to 
detine the normalised 7;(A | @) required by Detinition 5.9, A natural increasing sequence of 
subsets of the original parameter space. R* x R*, for (a. .3) would be the sets 


P= {lat Ocan<snd<d<if. fel... 


which transform, in the space of \ € A, into the sequence 


We note that unlike in the previous cases we have considered, this does depend on 0. 
To complete the analysis, it can be shown, after some manipulation, that. for large i, 


2 


' 
= nan. H2y\-1 i Ji 7 
a it f/m + Jay” ‘ (+35) 


and 
vnm a Le, A 1 ) “e 
fees Be fhe oe ghaes IX. 
Te i(Vm + St) dau) (me are Aes rey : 


which leads to a reference prior relative to the ordered parametrisation (©. ) given by 


T(@A) x ol? A"! (+ + <) : 


m nr 
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In the original parametrisation, this corresponds to 

a(a. 3B) x (na? + m3)”, 
which depends on the sample sizes through the ratio m/n and reduces, in the case n = m, 
to m(a, 3) x (a? + 37)”, a form originally proposed for this problem in an unpublished 
1982 Stanford University technical report by Stein, who showed that it provides approximate 
agreement between Bayesian credible regions and classical confidence intervals for ¢. Fora 


detailed discussion of this example, and of the consequences of choosing a different sequence 
A;(@), see Berger and Bernardo (1989). 


We note that the preceding example serves to illustrate the fact that reference 
priors may depend explicitly on the sample sizes defined by the experiment. There 
is, of course, nothing paradoxical in this, since the underlying notion of a reference 
analysis is a “minimally informative” prior relative to the actual experiment to be 
performed. 


5.4.5 Muiltiparameter Problems 


The approach to the nuisance parameter case considered above was based on the 
use of an ordered parametrisation whose first and second components were (¢, ), 
referred to, respectively, as the parameter of interest and the nuisance parameter. 
The reference prior for the ordered parametrisation (@, 4) was then constructed by 
conditioning to give the form 7(A | ¢)7(d). 

When the model parameter vector @ has more than two components, this suc- 
cessive conditioning idea can obviously be extended by considering @ as an ordered 
parametrisation, (61,..., 6), Say, and generating, by successive conditioning, a 
reference prior, relative to this ordered parametrisation, of the form 


(0) = (Om | A... 5 9m-1) one -1(O2 | @,)7(0;). 


In order to describe the algorithm for producing this successively conditioned 
form, in the standard, regular case we shall first need to introduce some notation. 
Assuming the parametric model p(x |@), @ € ©, to be such that the Fisher 


information matrix 
2 


0 
(0) = -Ex 9 \ arag, venelo)} 


has full rank, we define $(@) = H~'(8), define the component vectors 
bl = (0,,...,0)). Oy) = (Ojats +» Oin)s 


and denote by S;(@) the corresponding upper left 7 x j submatrix of S(@), and by 
h; (4) the lower right element of S>'(8). 

Finally, we assume that © = ©, x---x0,,, with0, € 0;,and, for? = 1,2,..., 
we denote by {@!},/ = 1,2,..., an increasing sequence of compact subsets of ©,, 
and define Of) = O54, x °° x OF, 
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Proposition 5.30. (Ordered reference priors under asymptotic normality). 
With the above notation, and under regularity conditions extending those of 
Proposition 5.29 in an obvious way, the reference prior (8), relative to the 
ordered parametrisation (0,..... 0,,,), is given by 


n(0) = lim —9) for some 8 € ©. 


m (8) 
(0°) 
where 7'(@) is defined by the following recursion: 

(i) For j] =m, and 6,, € @! 


me 


[m-1] = {Ry (O)} i2 
Tin re (Otic qj |6 ) ni, (8), {A Vee. 6, - j= Toy (hn (OY? 8, (h,, (6) } 70, 
(ii) Forj =m—1,m-—2,.... 2, and 6; € oi, 
. t 
1(q,\al-4) <at., (aqjell) err dE lleosthOr]y 
qT; (4,- |6 ) Ti+ (a,16 )F Jot exp {E! [log{h, (@)} 1) 2) } 6, 


where 


E} [log{h;(@)}"?] = | , lost h,(8)}' a5, (8, 18%) d8;. 
1, 


(iii) For j = 1, Oo) = 8, with 6!) vacuous. and 
n'(8) = 7} (8;(0") 


Proof. This follows closely the development given in Proposition 5.29. For 
details see Berger and Bernardo (1992a. 1992b, 1992c).  g 


The derivation of the ordered reference prior is greatly simplified if the {h, (0) } 
terms in the above depend only on 6!/!: even greater simplification obtains if H(@) 
is block diagonal, particularly, if, for 7 = 1.....m, the jth term can be factored 
into a product of a function of @, and a function not depending on 6). 


Corollary. If h;(@) depends onlv on oul, Say bree m, then 


vey .T {h,(O)}'? i 
7 (6) = II Jor {h,(V}7? a8, » GEO. 
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If H(@) is block diagonal (i.e., 0,,... , Am are mutually orthogonal), with 


hy (0) 0 vee 0 
HO ase Coe ay 
0 0 ie hinm(O) 
then hj(@) = h;;(@),j = 1,...,m. Furthermore, if, in this latter case, 


{j3(0)}'? = f;(8;)9;(8), 
where g;(@) does not depend on @,, and if the 3 ’s do not depend on @, then 


n(0) x [] £,(4,): 


j=l 


Proof. The results follow from the recursion of Proposition 5.29. g 


The question obviously arises as to the appropriate ordering to be adopted in 
any specific problem. At present, no formal theory exists to guide such a choice, 
but experience with a wide range of examples suggests that—at least for non- 
hierarchical models (see Section 4.6.5), where the parameters may have special 
forms of interrelationship—the best procedure is to order the components of # on 
the basis of their inferential interest. 


Example 5.20. (Reference analysis for m normal means). Let ¢,, be an experiment 
which consists in obtaining {x)..... z,}, 2 > 2, a random sample from the multivariate 


normal model N,,,(a |. 7Z,,,), m > 1, for which the Fisher information matrix is easily 


seen to be I ‘ 
T il 
FETT = ( 0 ees) . 


It follows from Proposition 5.30 that the reference prior relative to the natural parametrisation 
(141;---sfm.T), is given by 

Rp... Me T) TO. 
Clearly, in this example the result does not, in fact, depend on the order in which the 
parametrisation is taken, since the parameters are all mutually orthogonal. 

The reference prior 7(j),..., jm.7) x 77! OF W( pL... . 5 fn.) x a7! if we para- 
metrise in terms of ¢ = 7 '/?, is thus the appropriate reference form if we are interested in 
any of the individual parameters. The reference posterior for any ;2, is easily shown to be 
the Student density 


W(u, |ty..... tn) = St (py | E,, (nm — 1)s°?,m(n — 1)) 


"W Ww nm 
= 2 = \2 
nz, = } Zijs nms* = y y (zi; - ¥;) 


tel i=l j=l 
which agrees with the standard argument according to which one degree of freedom should 
be lost by each of the unknown means. 
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Example 5.21. (Multinomial model). Let x = {1,..... r,, } be an observation from 
a multinomial distribution (see Section 3.2). so that 


n! , 
—————_—_—___— #!... 91. - Maj 
rier, ta - Erb! ( ) 


DU. ee Ty [Ore 6,,) = 


Ww 


from which the Fisher information matrix 


146, -58, 
es a 1 1 
1+ 6, —¥9, 
1 eek 1 
Ho 6,) x 
eres is 1 Ba, 
6, — £8, 
: Es yx 


is easily derived, with 


[A] =n" (: - x0 ie, 
ded el 


In this case, the conditional reference priors derived using Proposition 5.28 turn out to 
be proper, and there is no need to consider subset sequences {©'}. In fact, noting that 
H"(4).....0) is given by 


A(1 — 6)) —6,6, aie -4\6.,, 
1] -0)0, (1 -@:) ---  ~8,8,, 
-O,9 —Bi8, 6 Oy (1 ~ On,) 


we see that the conditional asymptotic precisions used in Proposition 5.29 are easily identi- 
fied, and hence that 


2 


t.2 12 
1-Yi le i — 
m(O;|@)..... 6,.\)x | Se ———e 0,<1-5 4, 
(4; 14 J 1) ( A; (; sea] fe — 


The required reference prior relative to the ordered parametrisation (6,..... 4), say, is then 
given by 
R(A...., An) « 1(0,)7(O218))--- 7 (8, 1 O1..... An—1) 
x OF P(1 — A) PAST = 8 — By) PPL -4,,) 17 
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which is proportional to 


[P90 Baye 
x (1 = 0,) /?(1 — 8; ~ O2)7/? «(1 = Oy = ++ = On) 782 + dOm. 
After some algebra, this implies that 
m(A | 11,..-, 7m) = Be (Ajr+4.n—171 +4), 


which, as one could expect, coincides with the reference posterior which would have been 
obtained had we initially collapsed the multinomial analysis to a binomial model and then 
carried out a reference analysis for the latter. Clearly, by symmetry considerations, the 
above analysis applies to any 6;, i = 1....,7, after appropriate changes in labelling and 
it is independent of the particular order in which the parameters are taken. For a detailed 
discussion of this example see Berger and Bernardo (1992a). Further comments on ordering 
of parameters are given in Section 5.6.2. 


Example 5.22. (Normal correlation coefficient). Let {x,,...,2,,} be arandom sam- 
ple from a bivariate normal distribution, N2(a | jz, 7), where 


2 
a 
p= (i). rin ( o7 a 
be P0102 OR 
Suppose that the correlation coefficient pis the parameter of interest, and consider the ordered 
parametrisation {p, j11, 2,0), 02}. It is easily seen that 


l+p° =p —p 
0 0 — — 
1-p? O1 C2 
l = 
a eee 0 
oO; is 
3 —p 
A(p, $4), 2, 01,02) = (1 - p’) : 0 0\02 oF : is 
x. ae ine 
a og oy ee oe 
a; OF 7,02 
2 2 
-e —p) 2-p 
Op : : O10: 2 
2 102 2 
so that 
oO ’ a: 
(i-p*)? 0 0 Fell—6*) Fell - 6) 
0 a po\o2 0 0 
H? = 0 PO)\02 o3 us 0 
oe Co 0:10 
zell--) 0 0 = pa 
2 
m2 7, 2 27102 o 
5 PU po) 0 0 P->5 5 
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After some algebra it can be shown that this leads to the reference prior 


Wp. ty. fla. ).02) x (1 — p?) ‘a, lay! 
whatever ordering of the nuisance patameters 1). j{2.).2 is taken. This agrees with 
Lindley‘s (1965, p. 219) analysis. Furthermore. as one could expect from Fisher's (1915) 
original analysis, the corresponding reference posterior distribution for p 


ene (Qi al od a 1 1 1 l+pr 

ue girs i + ae = A . 
PP Sea pees AB Oe oe 
(where F is the hypergeometric function), only depends on the data through the sample cor- 


relation coefficient r, whose sampling distribution only depends on p. For a detailed analysis 
of this example, see Bayarvi (1981); further discussion will be provided in Section 5.6.2. 


See, also, Hills (1987), Ye and Berger (1991) and Berger and Bernardo (1992b) 
for derivations of the reference distributions for a variety of other interesting models. 


Infinite discrete parameter spaces 

The infinite discrete case presents special problems, due to the non-existence of an 
asymptotic theory comparable to that of the continuous case. It is, however. often 
possible to obtain an approximate reference posterior by embedding the discrete 
parameter space within a continuous one. 


Example 5.23. (Infinite discrete case). In the context of capture-recapture problems. 


suppose it is of interest to make inferences about an integer @ € {1.2....} on the basis of a 
random sample z = {7)...... r,} from 
8(6 + 1) . 
f° = ’ << 
p(.ri@) (r+ 8)? re | 


For several plausible “diffuse looking” prior distributions for @ one finds that the correspond- 
ing posterior virtually ignores the data. Intuitively. this has to be interpreted as suggesting 
that such priors actually contain a large amount of information about @ compared with that 
provided by the data. A more careful approach to providing a “non-informative” prior is 
clearly required. One possibility would be to embed the discrete space {1.2....} in the 
continuous space |0. 9¢[ since. for each @ > 0. p(.c|@) is still a probability density for r. 
Then, using Proposition 5.24. the appropriate refrence prior is 


(0) x h(O)'? x (@4+1)°'8 ! 


and it is easily verified that this prior leads to a posterior in which the data are no longer 
overwhelmed. If the physical conditions of the problem require the use of discrete # values. 
one could always use. for example, 

2 


ed ee bs2 
p@=1lz)= [ n(O| z)dé. pe=jlz)= | n(6|z)d@. j >] 
Ju Jy 12 


as an approximate discrete reference posterior. 
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Prediction and Hierarchical Models 


Two classes of problems that are not covered by the methods so far discussed are 
hierarchical models and prediction problems. The difficulty with these problems 
is that there are unknowns (typically the unknowns of interest) that have specified 
distributions. For instance, if one wants to predict y based on z when (y, z) has 
density p(y, z | @), the unknown of interest is y, but its distribution is conditionally 
specified. One needs a reference prior for @, not y. Likewise, in a hierarchical 
model with, say, 41, j42,..., jtp being N(jz, | 4, A), the jz,"s may be the parameters 
of interest but a prior is only needed for the hyperparameters jp and A. 

The obvious way to approach such problems is to integrate out the variables 
with conditionally known distributions (y in the predictive problem and the {j;} in 
the hierarchical model), and find the reference prior for the remaining parameters 
based on this marginal model. The difficulty that arises is how to then identify 
parameters of interest and nuisance parameters to construct the ordering necessary 
for applying the reference prior method, the real parameters of interest having been 
integrated out. 

In future work, we propose to deal with this difficulty by defining the parameter 
of interest in the reduced model to be the conditional mean of the original parameter 
of interest. Thus, in the prediction problem, E{y|@] (which will be either @ or some 
transformation thereof) will be the parameter of interest, and in the hierarchical 
model E{j1; | 40, A] = po will be defined to be the parameter of interest. This 
technique has so far worked well in the examples to which it has been applied, but 
further study is clearly needed. 


5.5 NUMERICAL APPROXIMATIONS 


Section 5.3 considered forms of approximation appropriate as the sample size be- 
comes large relative to the amount of information contained in the prior distribution. 
Section 5.4 considered the problem of approximating a prior specification maximis- 
ing the expected information to be obtained from the data. In this section, we shall 
consider numerical techniques for implementing Bayesian methods for arbitrary 
forms of likelihood and prior specification, and arbitrary sample size. 

We note that the technical problem of evaluating quantities required for Bayes- 
ian inference summaries typically reduces to the calculation of a ratio of two inte- 
grals. Specifically, given a likelihood p(a | @) and a prior density p(@), the starting 
point for all subsequent inference summaries is the joint posterior density for @ 


given by 
_ _ r(z|6)p(@) 
J p(x | @)p(@) d6 


From this, we may be interested in obtaining univariate marginal posterior densities 
for the components of @, bivariate joint marginal posterior densities for pairs of 


P(8 | x) 
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components of @, and so on. Alternatively, we may be interested in marginal 
posterior densities for functions of components of 6 such as ratios or products. 

In all these cases, the technical key to the implementation of the formal solution 
given by Bayes’ theorem, for specified likelihood and prior, is the ability to perform 
a number of integrations. First, we need to evaluate the denominator in Bayes” 
theorem in order to obtain the normalising constant of the posterior density; then 
we need to integrate over complementary components of @, or transformations 
of 9, in order to obtain marginal (univariate or bivariate) densities. together with 
summary moments, highest posterior density intervals and regions, or whatever. 
Except in certain rather stylised problems (e.g., exponential families together with 
conjugate priors), the required integrations will not be feasible analytically and, 
thus. efficient approximation strategies will be required. 

In this section, we shall outline five possible numerical approximation strate- 
gies, which will be discussed under the subheadings: Laplace Approximation; 
Iterative Quadrature; Importance Sampling; Sampling-importance-resampling; 
Markov Chain Monte Carlo. An exhaustive account of these and other methods 
will be given in the second volume in this series, Bayesian Computation. 


5.5.1 Laplace Approximation 


We motivate the approximation by noting that the technical problem of evaluating 
quantities required for Bayesian inference summaries, is typically that of evaluating 
an integral of the form 


E[9(8)|2] = / 9(8)p(8 | x)a8. 


where p(@ | x) is derived from a predictive model with an appropriate representation 
as a mixture of parametric models, and g(@) is some real-valued function of interest. 
Often, g(@) is a first or second moment. and since p(@| x) is given by 


p(x | @)p() 


p(B &) = FF] 8)p()d0 


we see that E'[g(@) | a)] has the form of a ratio of two integrals. 

Focusing initially on this situation of a required inference summary for (8). 
and assuming g(@) almost everywhere positive. we note that the posterior expec- 
tation of interest can be written in the form 


_ fexp{-nh"()}d@ 


Elg\@)\e)'= fexp{-nh(0)}d0 
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where, with the vector x = (2),...,2,,) of observations fixed, the functions h(@) 
and h*(@) are defined by 


—nh(@) = log p(@) + log p(x | @). 
—nh*(@) = log g{@) + log p(@) + log p(x | @). 


Let us consider first the case of a single unknown parameter, 8 = 0 € 8, and define 
0, 0" and G, o* such that 


—h(6) =sup{-h(@)}, = (RO? | 
¢ =) 


—h(6") = sup{-h°(8)}, oo = (RO)? | 
0 =| * 

Assuming h({-), h*(-) to be suitably smooth functions, the Laplace approximations 

for the two integrals defining the numerator and denominator of E[g(@) | a] are 

given (see, for example, Jeffreys and Jeffreys, 1946) by 


2na’n~'/? exp {—nh"(6")}, 


and . 
2ron-"/? exp { -nh(6)} F 


Essentially, the approximations consist of retaining quadratic terms in Taylor ex- 
pansions of h(-) and h”(-), and are thus equivalent to normal-like approximations 
to the integrands. In the context we are considering, it then follows immediately 
that the resulting approximation for E[g(@) | x] has the form 


a o* won F 
Elo(0) 2] = (Z) exp {—n [ar(or) — n)]} 
and Tierney and Kadane (1986) have shown that 


Elg(0) |x] = Elg(@)| a] (1+ O(n”). 


The Laplace approximation approach, exploiting the fact that Bayesian inference 
summaries typically involve ratios of integrals, is thus seen to provide a potentially 
very powerful general approximation technique. See, also, Tierney, Kass and 
Kadane (1987, 1989a, 1989b), Kass, Tierney and Kadane (1988, 1989a, 1989b, 
1991) and Wong and Li (1992) for further underpinning of, and extensions to, this 
methodology. 

Considering now the general case of @ € ®t, the Laplace approximation to 
the denominator of E[g(@) | z] is given by 


[exp{-nn(o)} 49 = (2n)k/? nw2n(a)| exp {-nn(6)} : 
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where @ is defined by 


—h(@) = sup h(@) 
7] 


and 


2p) _ Oh(@) 
Iv n(0)| ~ 0,00, 


6-0. 
the Hessian matrix of f evaluated at @, with an exactly analogous expression for 
the numerator, defined in terms of h“(-) and @°. Writing 


nope? 
r= rv?n(6)| 


ot = |nV*n'(")| 
the Laplace approximation to E[g(@) | a] is given by 


E(g(@) |z] = (=) exp { — nfho(@") - h(@)}}. 


completely analogous to the univariate case. 

If @ = (@, A) and the required inference summary is the marginal posterior 
density for @, application of the Laplace approximation approach corresponds to 
obtaining p(@| 2) pointwise by fixing @ in the numerator and defining g(A) = 1. 
It is easily seen that this leads to 


p(d|z) x [ov{-mgor} aa 


sg [een] exp {=nh,(d.)} : 


where 
—nh, (A) = log p(@. A) + log p(z|@. A). 


considered as a function of X for fixed @, and 


—h,(A,) = —suph,(A). 
Xr 


The form p(@ | x) thus provides (up to proportionality) a pointwise approximation 
to the ordinates of the marginal posterior density for @. Considering this form in 
more detail, we see that, if p(@. A) is constant, 


p(d| x2) x -V? log p(x! @.4.,)| p(x | o.X.,). 
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The form V? log p(x |. do) is the Hessian of the log-likelihood function, consid- 
ered as a function of A for fixed @, and evaluated at the value X,, which maximises 
the log-likelihood over 2 for fixed @; the form p(z | @. Az.) is usually called the 
profile likelihood for @, corresponding to the parametric model p(x | @. A). The 
approximation to the marginal density for @ given by p(@| a) has a form often 
referred to as the modified profile likelihood (see, for example, Cox and Reid, 1987, 
for a convenient discussion of this terminology). Approximation to Bayesian in- 
ference summaries through Laplace approximation is therefore seen to have links 
with forms of inference summary proposed and derived from a non-Bayesian per- 
spective. For further references, see Appendix B, Section 4.2. 

In relation to the above analysis, we note that the Laplace approximation is 
essentially derived by considering normal approximations to the integrands ap- 
pearing in the numerator and denominator of the general form £[g(@) | x]. If the 
forms concerned are not well approximated by second-order Taylor expansions of 
the exponent terms of the integrands, which may be the case with small or mod- 
erate samples, particularly when components of @ are constrained to ranges other 
than the real line, we may be able to improve substantially on this direct Laplace 
approximation approach. 

One possible alternative, at least if @ = @ is a scalar parameter, is to attempt 
to approximate the integrands by forms other than normal, perhaps resembling 
more the actual posterior shapes, such as gammas or betas. Such an approach has 
been followed in the one-parameter case by Morris (1988), who develops a general 
approximation technique based around the Pearson family of densities. These are 
characterised by parameters m, jz) and a quadratic function Q, which specify a 
density for # of the form 


p(9) 
gQ(9|m. Ho) = Ka(m, Ho) O18) Ol)” 


xooco(-n] (a6) 


ght.) = f (BR) ae, 
Q(8) = go + 19+ Qh 


and the range of @ is such that 0 < Q(8) < cx. 
It is shown by Morris (1988) that, for a given choice of quadratic function 
Q, an analogue to the Laplace-type approximation of an integral of a unimodal 


function f(@) is given by 
[100-29 — 


qa(| th, 6) 


where 
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where riz = r"(6)Q(0) and @ maximises r(@) = log[f(0)Q(@)]. Details of the 
forms of K !, Q and p for familiar forms of Pearson densities are given in Morris 
(1988), where it is also shown that the approximation can often be further simplified 
to the expression 


[ro = LO 
rye? 


A second alternative is to note that the version of the Laplace approximation 
proposed by Tierney and Kadane (1986) is not invariant to changes in the (ar- 
bitrary) parametrisation chosen when specifying the likelihood and prior density 
functions. It may be, therefore. that by judicious reparametrisation (of the likeli- 
hood, together with the appropriate, Jacobian adjusted, prior density) the Laplace 
approximation can itself be made more accurate, even in contexts where the original 
parametrisation does not suggest the plausibility of a normal-type approximation 
to the integrands. We, note, incidentally, that such a strategy is also available in 
multiparameter contexts, whereas the Pearson family approach does not seem so 
readily generalisable. 

To provide a concrete illustration of these alternative analytic approximation 
approaches consider the following. 


Example 5.24. (Approximating the mean of a beta distribution). 

Suppose that a posterior beta distribution. Be(@ | r,, ~ ‘. n—r, + 4), has arisen from 
a Bi(r,, | 11.6) likelihood. together with. Be(#| }. 4) prior (the reference prior. derived in 
Example 5.14). Writing r,, = .. we can, in fact. identify the analytic form of the posterior 
mean in this case. 


. | 
+s 


. Ad 
E@ta)= : 
: n+ 
but we shall ignore this for the moment and examine approximations implied by the tech- 
niques discussed above. 
First, defining y(@) = 6, we see, after some algebra, that the Tierney-Kadane form of 
the Laplace approximation gives the estimated posterior mean 


BOR oer ibe ese 
edo sy 


Eel «| 


If, instead, we reparametrise to o = sin ' V/@, the required integrals are defined in terms of 


’ 


yo) = sino. p(x]o) x (sin? e)' (1 — si’ oe)" ao) x1. 


and the Laplace approximation can be shown to be 


nit! "Ur sf Lyre 


EW\1|= ere 
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Alternatively, if we work via the Pearson family, with Q(@) = 0(1 — 9) as the “natural” 
choice for a beta-like posterior, we obtain 


ue fi 4 4 “tl 
cro = tee 
(n + 2)"*3 (r+ 4) 


By considering the percentage errors of estimation, defined by 


true ~ estimated 
true 


’ 


100 x | 


we can study the performance of the three estimates for various values of n and x. Details 
are given in Achcar and Smith (1989); here, we simply summarise, in Table 5.1, the results 
for n = 5, x = 3, which typify the performance of the estimates for small 7. 


Table 5.1 Approximation of E' | x] from Be(6|r + 4.n- 2+ }) 
(percentage errors in parentheses) 


True value Laplace approximations Pearson approximation 


E(@|2] Efe |2] E*(6| x] 
0.583 0.563 0.580 0.585 
(3.6%) (0.6%) (0.3%) 


We see from Table 5.1 that the Pearson approximation, which is, in some sense, preselected to 
be best, does, in fact, outperform the others. However, it is striking that the performance of the 
Laplace approximation under reparametrisation leads to such a considerable improvement 
over that based on the original parametrisation, and is a very satisfactory alternative to the 
“optimal” Pearson form. Further examples are given in Achcar and Smith (1989). 


In general, it would appear that, in cases involving a relatively small number of 
parameters, the Laplace approach, in combination with judicious reparametrisation, 
can provide excellent approximations to general Bayesian inference summaries, 
whether in the form of posterior moments or marginal posterior densities. However, 
in multiparameter contexts there may be numerical problems with the evaluation 
of local derivatives in cases where analytic forms are unobtainable or too tedious to 
identify explicitly. In addition, there are awkward complications if the integrands 
are multimodal. At the time of writing, this area of approximation theory is very 
much still an active research field and the full potential of this and related methods 
(see, also, Lindley. 1980b, Leonard er al., 1989) has yet to be clarified. 
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5.5.2 iterative Quadrature 


It is well known that univariate integrals of the type 


are often well approximated by Gauss-Hermite quadrature rules of the form 


n 


s- wy f(t;). 


race | 


where /; is the ith zero of the Hermite polynomial H,,(t). In particular, if f(t) is 
a polynomial of degree at most 277 — 1, then the quadrature rule approximates the 
integral without error. This implies, for example, that, if h(t) is a suitably well 
behaved function and 


g(t) = hit) (2xa*) "exp { 7 5 (S*) \ ; 
then 
/ g(t)dt = S* m,g(=,). 
ar L 


where _ 
m, = w,exp(?)V20. 2s = + V20t, 


(see, for example, Naylor and Smith, 1982). 

It follows that Gauss-Hermite rules are likely to prove very efficient for func- 
tions which, expressed in informal terms, closely resemble “polynomial x normal” 
forms. In fact, this is a rather rich class which, even for moderate n (less than 12. 
say), covers many of the likelihood x prior shapes we typically encounter for 
parameters defined on (—2. 9c). Moreover, the applicability of this approxima- 
tion is vastly extended by working with suitable transformations of parameters 
defined on other ranges such as (0. 2¢) or (a@.6), using. for example, log(t) or 

log(t — a) — log(b — t). respectively. Of course, to use the above form we must 
specify ;z and o in the normal component. It turns out that, given reasonable start- 
ing values (from any convenient source. prior information, maximum likelihood 
estimates, etc.), we typically can successfully iterate the quadrature rule, substitut- 
ing estimates of the posterior mean and variance obtained using previous values of 
mi and 2z;. Moreover, we note that if the posterior density is well-approximated 
by the product of a normal and a polynomial of degree at most 2n — 3. then an 
n-point Gauss-Hermite rule will prove effective for simultaneously evaluating the 
normalising constant and the first and second moments, using the same (iterated) 
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set of m; and z;. In practice, it is efficient to begin with a small grid size (n = 4 or 
n = 5) and then to gradually increase the grid size until stable answers are obtained 
both within and between the last two grid sizes used. 


Our discussion so far has been for the one-dimensional case. Clearly, how- 
ever, the need for an efficient strategy is most acute in higher dimensions. The 
“obvious” extension of the above ideas is to use a cartesian product rule giving the 
approximation 


fof tte stad 2. ath & Yom oe gf) 
ik 


where the grid points and the weights are found by substituting the appropriate 
iterated estimates of 4 and a? corresponding to the marginal component t;. 


The problem with this “obvious” strategy is that the product form is only ef- 
ficient if we are able to make an (at least approximate) assumption of posterior 
independence among the individual components. In this case, the lattice of in- 
tegration points formed from the product of the two one-dimensional grids will 
efficiently cover the bulk of the posterior density. However, if high posterior corre- 
lations exist, these will lead to many of the lattice points falling in areas of negligible 
posterior density, thus causing the cartesian product rule to provide poor estimates 
of the normalising constant and moments. 


To overcome this problem, we could first apply individual parameter trans- 
formations of the type discussed above and then attempt to transform the resulting 
parameters, via an appropriate linear transformation, to a new, approximately ortho- 
gonal, set of parameters. At the first step, this linear transformation derives from 
an initial guess or estimate of the posterior covariance matrix (for example, based 
on the observed information matrix from a maximum likelihood analysis). Suc- 
cessive transformations are then based on the estimated covariance matrix from the 
previous iteration. 


The following general strategy has proved highly effective for problems in- 
volving up to six parameters (see, for example, Naylor and Smith, 1982, Smith er 
al., 1985, 1987, Naylor and Smith, 1988). 


(1) Reparametrise individual parameters so that the resulting working parameters 
all take values on the real line. 


(2) Using initial estimates of the joint posterior mean vector and covariance ma- 
trix for the working parameters, transform further to a centred, scaled, more 
“orthogonal” set of parameters. 


(3) Using the derived initial location and scale estimates for these “orthogonal” 
parameters, carry out, on suitably dimensioned grids, cartesian product inte- 
gration of functions of interest. 
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(4) Iterate, successively updating the mean and covariance estimates, until stable 
results are obtained both within and between grids of specified dimension. 


For problems involving larger numbers of parameters, say between six and 
twenty, cartesian product approaches become computationally prohibitive and al- 
ternative approaches to numerical integration are required. 

One possibility is the use of spherical quadrature rules (Stroud, 1971. Sec- 
tions 2.6, and 2.7), derived by transforming from cartesian to spherical polar coor- 
dinates and constructing optimal integration formulae based on symmetric config- 
urations over concentric spheres. Full details of this approach will be given in the 
volume Bayesian Computation. For a brief introduction, see Smith (1991). Other 
relevant references on numerical quadrature include Shaw (1988b). Flournoy and 
Tsutakawa (1991), O"Hagan (1991) and Dellaportas and Wright (1992). 

The efficiency of numerical quadrature methods is often very dependent on the 
particular parametrisation used. For further information on this topic, see Marriott 
(1988), Hills and Smith (1992, 1993) and Marriott and Smith (1992). For related 
discussion, see Kass and Slate (1992). 

The ideas outlined above relate to the use of numerical quadrature formulae 
to implement Bayesian statistical methods. It is amusing to note that the roles 
can be reversed and Bayesian statistical methods used to derive optimal numerical 
quadrature formulae! See, for example, Diaconis (1988b) and O'Hagan (1992). 


5.5.3 Importance Sampling 


The importance sampling approach to numerical integration is based on the obser- 
vation that, if f is a function and g is a probability density function 


J f(a)dr = } f ne glx)dx 
g(x) 

=/\5 f (2)] 4 

g(r) 
6, {Le . 

g(r) 

which suggest the “statistical” approach of generating a sample trom the distribution 
function G—referred to in this context as the importance sampling distribution — 
and using the average of the values of the ratio f/g as an unbiased estimator of 
f{ f(2)dx. However, the variance of such an estimator clearly depends critically 
on the choice of G, it being desirable to choose y to be “similar” to the shape of /. 
In multiparameter Bayesian contexts, exploitation of this idea requires design- 

ing importance sampling distributions which are efficient for the kinds of integrands 


arising in typical Bayesian applications. A considerable amount of work has fo- 
cused on the use of multivariate normal or Student forms, or modifications thereof. 
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much of this work motivated by econometric applications. We note, in particular, 
the contributions of Kloek and van Dijk (1978), van Dijk and Kloek (1983, 1985), 
van Dijk et al. (1987) and Geweke (1988, 1989). 

An alternative line of development (Shaw, 1988a) proceeds as follows. In the 
univariate case, if we choose g to be heavier-tailed than f, and if we work with 
y = G(z), the required integral is the expected value of f[G~'(xr)]/g([G"'(x)] 
with respect to a uniform distribution on the interval (0, 1). Owing to the periodic 
nature of the ratio function over this interval, we are likely to get a reasonable 
approximation to the integral by simply taking some equally spaced set of points 
on (0, 1), rather than actually generating “uniformly distributed” random numbers. 
If f is a function of more than one argument (&, say), an exactly parallel argument 
suggess that the choice of a suitable g followed by the use of a suitably selected 
“uniform” configuration of points in the #-dimensional unit hypercube will provide 
an efficient multidimensional integration procedure. 

However, the effectiveness of all this depends on choosing a suitable G, bearing 
in mind that we need to have available a flexible set of possible distributional shapes, 
for which G~! is available explicitly. In the univariate case, one such family defined 
on & is provided by considering the random variable 


Lq = ah(u) - (1 —a)h(1- 1), 


where w is uniformly distributed on (0,1), A : (0,1) — Ris a monotone increasing 
function such that 
lim h(u) = —oo 
u—O 

and 0 < a < 1 is aconstant. The choice a = 0.5 leads to symmetric distributions; 
as a — 0 ora — 1 we obtain increasingly skew distributions (to the left or right). 
The tail-behaviour of the distributions is governed by the choice of the function A. 
Thus, forexample, h(u) = log(1) leads toa family whose symmetric member is the 
logistic distribution; h(u) = — tan [7(1 ~ u)/2] leads toa family whose symmetric 
member is the Cauchy distribution. Moreover, the moments of the distributions of 
the x, are polynomials in a (of corresponding order), the median is linear in a, etc., 
so that sample information about such quantities provides (for any given choice of 
h) operational guidance on the appropriate choice of a. To use this family in the 
multiparameter case, we again employ individual parameter transformations, so 
that all parameters belong to R, together with “orthogonalising” transformations, 
so that parameters can be treated “independently”. In the transformed setting, it 
is natural to consider an iterative importance sampling strategy which attempts to 
learn about an appropriate choice of G for each parameter. 

As we remarked earlier, part of this strategy requires the specification of “uni- 
form” configurations of points in the k-dimensional unit hypercube. This problem 
has been extensively studied by number theorists and systematic experimentation 
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with various suggested forms of “quasi-random” sequences has identified effective 
forms of configuration for importance sampling purposes: for details. see Shaw 
(1988a). The general strategy is then the following. 


(1) Reparametrise individual parameters so that resulting working parameters all 
take values on the real line. 


(2) Using initial estimates of the posterior mean vector and covariance matrix for 
the working parameters. transform to a centred, scaled, more “orthogonal” set 
of parameters. 


(3) In terms of these transformed parameters, set 


k 


g(x) = [] 9i(+,)- 


ial 


for “suitable” choices of 9), j = 1..... k, 


(4) Use the inverse distribution function transformation to reduce the problem to 
that of calculating an average over a “suitable” uniform configuration in the 
k-dimensional hypercube. 


(5) Use information from this “sample” to learn about skewness. tailweight, etc. 
for each g;, and hence choose “better” g,. j = 1..... kh, and revise estimates 
of the mean vector and covariance matrix. 


(6) Iterate until the sample variance of replicate estimates of the integral value is 
sufficiently small. 


Teichroew (1965) provides a historical perspective on simulation techniques. 
For further advocacy and illustration of the use of (non-Markov-chain) Monte Carlo 
methods in Bayesian Statistics, see Stewart (1979, 1983, 1985, 1987), Stewart and 
Davis (1986), Shao (1989, 1990) and Wolpert (1991). 


5.5.4 Sampling-importance-resampling 


Instead of just using importance sampling to estimate integrals —and hence calcu- 
late posterior normalising constants and moments —we can also exploit the idea 
in order to produce simulated samples from posterior or predictive distributions. 
This technique is referred to by Rubin (1988) as sampling-importance-resampling 
(SIR). 

We begin by taking a fresh look at Bayes’ theorem from this sampling- 
importance-resampling perspective. shifting the focus in Bayes’ theorem from 
densities to samples. Our account is based on Smith and Gelfand (1992). 

As a first step, we note the essential duality between a sample and the distri- 
bution from which it is generated: clearly, the distribution can generate the sample; 
conversely, given a sample we can re-create, at least approximately, the distribution 
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(as a histogram, an empirical distribution function, a kernel density estimate, or 
whatever). In terms of densities, Bayes’ theorem defines the inference process as 
the modification of the prior density p(@) to form the posterior density p(@| zx), 
through the medium of the likelihood function p(z|@). Shifting to a sampling 
perspective, this corresponds to the modification of a sample from p(@) to form a 
sample from p(@ | x) through the medium of the likelihood function p(z | 0). 

To gain insight into the general problem of how a sample from one density 
may be modified to form a sample from a different density, consider the following. 
Suppose that a sample of random quantities has been generated from a density g(8), 
but that what it is required is a sample from the density 


where only the functional form of f(@) is specified. Given f(@) and the sample 
from g(@), how can we derive a sample from h(@)? 
In cases where there exists an identifiable constant Af > 0 such that 


f(@)/g(@) < M, forall 6, 
an exact sampling procedure follows immediately from the well known rejection 
method for generating random quantities (see, for example, Ripley, 1987, p. 60): 
(i) consider a @ generated from g(@); 

(ii) generate u from Un(u | 0.1); 
(iii) if u < f(@)/Alg(@) accept 8; otherwise repeat (i)-(iii). 

Any accepted 6 is then a random quantity from h(@). Given a sample of size 
N for g(@), it is immediately verified that the expected sample size from h(@) is 
Mo'N f f(x)dz. 

In cases where the bound A/ in the above is not readily available, we can 


approximate samples from h(@) as follows. Given 0;,.... 6x from g(@), calculate 
? 0; 
q= 7 » where w,; = $(91) : 
>a Wi 9 i) 
If we now draw @* from the discrete distribution {0;,...,@,} having mass 4; 


on 6;, then @” is approximately distributed as a random quantity from h(@). To 
see this, consider, for mathematical convenience, the univariate case. Then, under 
appropriate regularity conditions, if P describes the actual distribution of 9”. 


P(@* <a) 


N 
Se QA(—x.0}(8;) 
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so that iC 
oe {Fah} 
Jim P(® < a) = 5 {Oy 
71 g(9) 
f(0) de , 
= Le = i h(@) dé. 
/ f(@)d@ “™ 


Since sampling with replacement is not ruled out, the sample size generated 
in this case can be as large as desired. Clearly, however, the less (0) resembles 
g(@) the larger NV’ will need to be if the distribution of @” is to be a reasonable 
approximation to h(@), 

With this sampling-importance-resampling procedure in mind, let us return 
to the prior to posterior sample process defined by Bayes’ theorem. For fixed a. 
define fa(@) = p(x|@)p(@). Then, if @ maximising p(z | 6) is available, the 
rejection procedure given above can be applied to a sample for p(@) to obtain a 
sample from p(@| 2a) by taking g(@) = p(@). f(@) = fa(@) and VJ = p(x |@). 
Bayes’ theorem then takes the simple form: 

For each 6 in the prior sample, accept @ into the posterior sample with prob- 

ability 


The likelihood therefore acts in an intuitive way to define the resampling 
probability: those @ with high likelihoods are more likely to be represented in 
the posterior sample. Alternatively, if Af is not readily available, we can use the 
approximate resampling method, which selects 0, into the posterior sample with 
probability 

PTO) 
ay | P(x | 8;) 
Again we note that this is proportional to the likelihood, so that the inference process 
via sampling proceeds in an intuitive way. 

The sampling-resampling perspective outlined above opens up the possibility 
of novel applications of exploratory data analytic and computer graphical tech- 
niques in Bayesian statistics. We shall not pursue these ideas further here. since the 
topic is more properly dealt with in the subsequent volume Bayesian Computation. 
For an illustration of the method in the context of sensitivity analysis and intractable 
reference analysis, see Stephens and Smith (1992): for pedagogical illustration, see 
Albert (1993). 


qi = 
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5.5.5 Markov Chain Monte Carlo 


The key idea is very simple. Suppose that we wish to generate a sample from a 
posterior distribution p(@|x) for @ € © c R* but cannot do this directly. However, 
suppose that we can construct a Markov chain with state space ©, which is straight- 
forward to simulate from, and whose equilibrium distribution is p(@|x). If we then 
run the chain for a long time, simulated values of the chain can be used as a basis 
for summarising features of the posterior p(@|a) of interest. To implement this 
strategy, we simply need algorithms for constructing chains with specified equilib- 
rium distributions. For recent accounts and discussion, see, for example, Gelfand 
and Smith (1990), Casella and George (1992), Gelman and Rubin (1992a, 1992b), 
Geyer (1992), Raftery and Lewis (1992), Ritter and Tanner (1992), Roberts (1992), 
Tierney (1992), Besag and Green (1993), Chan (1993), Gilks ef al. (1993) and 
Smith and Roberts (1993); see, also, Tanner and Wong (1987) and Tanner (1991). 

Under suitable regularity conditions, asymptotic results exist which clarify the 
sense in which the sample output from a chain with equilibrium distribution p(@|x) 
can be used to mimic a random sample from p(@|z) or to estimate the expected 
value, with respect to p(@|x), of a function g(@) of interest. 

If @',@7,...,6',...is a realisation from an appropriate chain, typically avail- 
able asymptotic results as t —* oc include 


o' — 6 ~ p(x), in distribution 
and 


t 
: x 9(8') — Ei {g(@)} almost surely. 
i=l 


It 


Clearly, successive 6' will be correlated, so that, if the first of these asymptotic 
results is to be exploited to mimic a random sample from p(@|z), suitable spacings 
will be required between realisations used to form the sample, or parallel indepen- 
dent runs of the chain might be considered. The second of the asymptotic results 
implies that ergodic averaging of a function of interest over realisations from a 
single run of the chain provides a consistent estimator of its expectation. 

In what follows, we outline two particular forms of Markov chain scheme, 
which have proved particularly convenient for a range of applications in Bayesian 
statistics. 


The Gibbs Sampling Algorithm 


Suppose that @, the vector of unknown quantities appearing in Bayes’ theorem, has 
components 6), ... , 9, and that our objective is to obtain summary inferences from 
the joint posterior p(@ |x) = p(O;,...,4,|a). As we have already observed in this 
section, except in simple, stylised cases, this will typically lead, unavoidably, to 
challenging problems of numerical integration. 
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In fact, this apparent need for sophisticated numerical integration technology 
can often be avoided by recasting the problem as one of iterative sampling of random 
quantities from appropriate distributions to produce an appropriate Markov chain. 
To this end, we note that 


PG, |@.6, jf A. FHL... A. 


the so-called full conditional densities for the individual components. given the data 
and specified values of all the other components of 0, are typically easily identified, 
as functions of 6,, by inspection of the form of p(@| a) x p(x | @)p(@) in any given 
application. Suppose then, that given an arbitrary set of starting values, 


for the unknown quantities, we implement the following iterative procedure: 
draw gi?) trom p(O, | x. gs” ict o\""). 
draw gs) from p(@y | x. gy . ay” ciate 6;."), 


draw gi) from p(63| 2.0). 0 a0"... gy”). 


draw 6!” from p(6; | a.6\"..... Nae 
draw gi) from p(6, |x. gs) Mache 64"), 


and so on. 
Now suppose that the above procedure is continued through ¢ iterations and 
is independently replicated m times so that from the current iteration we have 7 


replicates of the sampled vector 6° = (oi ae ot), where 6" is a realisation of a 
Markov chain with transition probabilities given by 


k 
n(@'. et) = = [Toer'ie.s >. Bes rae x). 


Then (see, for example, Geman and Geman, 1984, Roberts and Smith, 1994), 
ast + x, (a? Midige gi") tends in distribution to a random vector whose joint 
density is p(@|ax). In particular, g” tends in distribution to a random quantity 
whose density is p(@;|a). Thus, for large t, the replicates (aie siege g) are 


vn 


approximately a random sample from p(@; |). It follows, by making 1 suitably 
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large, that an estimate (6; | az) for p(0; | az) is easily obtained, either as a kernel 
density estimate derived from (6?,....6{), or from 


mH 


A 1 eee 
t=1 


So far as sampling from the p(0; | x, a3 #% 1) is concerned, i = 1,...,k, ei- 
ther the full conditionals assume familiar forms, in which case computer routines 
are typically already available, or they are simple arbitrary mathematical forms, in 
which case general stochastic simulation techniques are available —such as enve- 
lope rejection and ratio of uniforms — which can be adapted to the specific forms 
(see, for example, Devroye, 1986, Ripley, 1987, Wakefield er al., 1991, Gilks, 
1992, Gilks and Wild, 1992, and Dellaportas and Smith, 1993). See, also, Carlin 
and Gelfand (1991). 

The potential of this iterative scheme for routine implementation of Bayesian 
analysis has been demonstrated in detail for a wide variety of problems: see, for 
example, Gelfand and Smith (1990), Gelfand er al. (1990) and Gilks et al. (1993). 
We shall not provide a more extensive discussion here, since illustration of the 
technique in complex situations more properly belongs to the second volume of 
this work. We note, however, that simulation approaches are ideally suited to 
providing summary inferences (we simply report an appropriate summary of the 
sample), inferences for arbitrary functions of 6;,..., 0, (we simply form a sample 
of the appropriate function from the samples of the 9;’s) or predictions (for example, 
in an obvious notation, p(y|x) = m7 >>", p(y| a"), the average being over 
the Ce , which have an approximate p(@ | a) distribution for large t). 


The Metropolis-Hastings algorithm 
This algorithm constructs a Markov chain @', 6, ...,6',... with state space © and 


equilibrium distribution p(@|2:) by defining the transition probability from 6‘ = 6 
to the next realised state @'*” as follows. 

Let q(@, 8’) denote a (for the moment arbitrary) transition probability function, 
such that, if @’ = 6, the vector 6’ drawn from q(@, 6’) is considered as a proposed 
possible value for 6‘*!. However, a further randomisation now takes place. With 
some probability a(@,@’), we actually accept 6‘*’ = 6’; otherwise, we reject the 
value generated from q(@, 6’) and set 9‘*! = @. This construction defines a Markov 
chain with transition probabilities given by 


p(9, 6’) = q(0, 6’) a(8, 4’) 
+1(0=6) f = i q(9, 0") a(6,6”) a6" ; 
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where /(.) is the indicator function. If now we set 
; : p(0' | x)q(0'.@) \ 
a(6,6@) = ming! ———————— 

(8.0) = mint. ars adh 


it is easy to check that p(O@|a)p(@. 0’) = p(O|a)p( 0’. 0). which, provided that the 
thus far arbitrary q(@. 0’) is chosen to be irreducible and aperiodic on a suitable 
state space, is a sufficient condition for p(@|ax) to be the equilibrium distribution of 
the constructed chain. 

This general algorithm is due to Hastings (1970); see. also, Metropolis ef 
al. (1953), Peskun (1973), Tierney (1992), Besag and Green (1993), Roberts and 
Smith (1994) and Smith and Roberts (1993). It is important to note that the (equi- 
librium) distribution of interest, p(@laz), only enters p(@. 0) through the ratio 
p(O'\x)/p(@|a). This is quite crucial since it means that knowledge of the dis- 
tribution up to proportionality (given by the likelihood multiplied by the prior) is 
sufficient for implementation. 


5.6 DISCUSSION AND FURTHER REFERENCES 


5.6.1. An Historical Footnote 


Blackwell (1988) gave a very elegant demonstration of the way in which a simple 
finite additivity argument can be used to give powerful insight into the relation 
between frequency and belief probability. The calculation involved has added 
interest in that— according to Stigler (1982)—it might very well have been made 
by Bayes himself. 


The argument goes as follows. Suppose that 0-1 observables .rj...... ry, ~1 are 
finitely exchangeable. We observe @ = (:r)..... a, ) and wish to evaluate 
P(ty.1 = 1x) 
P(x, = Ola) 
Writing 5s = zp +--+ +r, p(t) = Play tee) t+ tues = ft), this ratio, by 


virtue of exchangeability, is easily seen to be equal to 


n+l 
s+1)/ 
eed et sien Le ee 


ws)/ ("2") p(s) tee HS. 


if p(s) ~ p(s + 1) and s and n — s are not too small. 
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This can be interpreted as follows. If, before observing x, we considered s 
and s + 1 to be about equally plausible as values for 7, + --- + 2,4, the resulting 
posterior odds for 2,4; = 1 will be essentially the frequency odds based on the 
first n trials. 

Inverting the argument, we see that if one wants to have this “convergence” 
of beliefs and frequencies it is necessary that p(s) ~ p(s + 1). But what does this 
entail? 

Reverting to an infinite exchangeability assumption, and hence the familiar 
binomial framework, suppose we require that p(@) be chosen such that 


pts) = [ (2) #0 - 0)" (o)a0 


does not depend on s. An easy calculation shows that this is satisfied if p(@) is 
taken to be uniform on (0, 1)—the so-called Bayes (or Bayes-Laplace) Postulate. 

Stigler (1982) has argued that an argument like the above could have been 
Bayes’ motivation for the adoption of this uniform prior. 


5.6.2 Prior ignorance 


To many attracted to the formalism of the Bayesian inferential paradigm, the idea 
of a non-informative prior distribution, representing “ignorance” and “letting the 
data speak for themselves” has proved extremely seductive, often being regarded 
as synonymous with providing objective inferences. It will be clear from the gen- 
eral subjective perspective we have maintained throughout this volume, that we 
regard this search for “objectivity” to be misguided. However, it will also be clear 
from our detailed development in Section 5.4 that we recognise the rather special 
nature and role of the concept of a “minimally informative” prior specification 
— appropriately defined! In any case, the considerable body of conceptual and the- 
oretical literature devoted to identifying “appropriate” procedures for formulating 
prior representations of “ignorance” constitutes a fascinating chapter in the history 
of Bayesian Statistics. In this section we shall provide an overview of some of the 
main directions followed in this search for a Bayesian “Holy Grail”. 

In the early works by Bayes (1763) and Laplace (1814/1952), the definition of 
a non-informative prior is based on what has now become known as the principle of 
insufficient reason, or the Bayes-Laplace postulate (see Section 5.6.1). According 
to this principle, in the absence of evidence to the contrary, all possibilities should 
have the same initial probability. This is closely related to the so-called Laplace- 
Bertrand paradox; see Jaynes (1971) for an interesting Bayesian resolution. 

In particular, if an unknown quantity, , say, can only take a finite number of 
values, Af, say, the non-informative prior suggested by the principle is the discrete 
uniform distribution p(@) = {1/Af,....1/Af}. This may, at first sight, seem 
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intuitively reasonable, but Example 5.16 showed that even in simple. finite, discrete 
cases care can be required in appropriately defining the unknown quantity of interest. 
Moreover, in countably infinite, discrete cases the uniform (now improper) prior is 
known to produce unappealing results. Jeffreys (1939/1961, p. 238) suggested, for 
the case of the integers. the prior 


mn) xn! n= 12... 


More recently, Rissanen (1983) used a coding theory argument to motivate the prior 


1 | 1 
m(n) x — x x ——— xk... n=Hl.2.n... 
logn n  loglogn 


However, as indicated in Example 5.23, embedding the discrete problem within a 
continuous framework and subsequently discretising the resulting reference prior 
for the continuous case may produce better results. 

If the space, ®, of @ values is a continuum (say, the real line) the principle of 
insufficient reason has been interpreted as requiring a uniform distribution over ®, 
However, a uniform distribution for implies a non-uniform distribution for any 
non-linear monotone transformation of « and thus the Bayes-Laplace postulate is 
inconsistent in the sense that, intuitively, “ignorance about ©” should surely imply 
“equal ignorance” about a one-to-one transformation of ©. Specifically, if some 
procedure yields p(@) as a non-informative prior for @ and the same procedure 
yields p(¢) as a non-informative prior for a one-to-one transformation ¢ = ¢(¢) of 
©, consistency would seem to demand that p(¢)d¢ = p(@)de, thus, a procedure for 
obtaining the “ignorance” prior should presumably be invariant under one-to-one 
reparametrisation. 

Based on these invariance considerations, Jeffreys (1946) proposed as a non- 
informative prior, with respect to an experiment ¢ = {.X.@. p(a|o)}. involving 
a parametric model which depends on a single parameter @. the (often improper) 
density 

mo) x A(g)!*. 


h(o) = - fx «| 4)- 


In effect, Jeffreys noted that the logarithmic divergence locally behaves like 
the square of a distance, determined by a Riemannian metric, whose natural length 
element is h(@)'/*, and that natural length elements of Riemannian metrics are 
invariant to reparametrisation. In an illuminating paper, Kass (1989) elaborated 
on this geometrical interpretation by arguing that, more generally, natural volume 
elements generate “uniform” measures on manifolds, in the sense that equal mass 


where 
ldx 
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is assigned to regions of equal volume, the essential property that makes Lebesgue 
measure intuitively appealing. 

In his work, Jeffreys explored the implications of such a non-informative prior 
for a large number of inference problems. He found that his rude (by definition 
restricted to a continuous parameter) works well in the one-dimensional case, but 
can lead to unappealing results (Jeffreys, 1939/1961, p. 182) when one tries to 
extend it to multiparameter situations. 

The procedure proposed by Jeffreys’ preferred rule was rather ad hoc, in that 
there are many other procedures (some of which he described) which exhibit the 
required type of invariance. His intuition as to what is required, however, was 
rather good. Jeffreys’ solution for the one-dimensional continuous case has been 
widely adopted, and a number of alternative justifications of the procedure have 
been provided. 

Perks (1947) used an argument based on the asymptotic size of confidence 
regions to propose a non-informative prior of the form 

m(o) x 9(6)7! 
where s(¢@) is the asymptotic standard deviation of the maximum likelihood estimate 
of @. Under regularity conditions which imply asymptotic normality, this turns out 
to be equivalent to Jeffreys’ rule. 

Lindley (1961b) argued that, in practice, one can always replace a continuous 
range of @ by discrete values over a grid whose mesh size, 5(@), say, describes the 
precision of the measuring process, and that a possible operational interpretation of 
“ignorance” is a probability distribution which assigns equal probability to all points 
of this grid. In the continuous case, this implies a prior proportional to 6(¢)~!. 
To determine 6(@) in the context of an experiment e = {X, 4, p(z| o)}, Lindley 
showed that if the quantity can only take the values ¢ or ¢ + 6(@), the amount of 
information that e may be expected to provide about ¢, if p(¢) = p(@ + 4(¢)) = 
+, is 267(d)h(6). This expected information will be independent of ¢ if 5(¢) 
x h(@)~'/?, thus defining an appropriate mesh; arguing as before, this suggests 
Jeffreys’ prior 7(@) « h(0)'/?. Akaike (1978a) used a related argument to justify 
Jeffreys’ prior as “locally impartial”. 

Welch and Peers (1963) and Welch (1965) discussed conditions under which 
there is formal mathematical equivalence between one-dimensional Bayesian cred- 
ible regions and corresponding frequentist confidence intervals. They showed that, 
under suitable regularity assumptions, one-sided intervals asymptotically coincide 
if the prior used for the Bayesian analysis is Jeffreys’ prior. Peers (1965) later 
showed that the argument does not extend to several dimensions. Hartigan (1966b) 
and Peers (1968) discuss two-sided intervals. Tibshirani (1989), Mukerjee and Dey 
(1993) and Nicolau (1993) extend the analysis to the case where there are nuisance 
parameters. 
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Hartigan (1965) reported that the prior density which minimises the bias of 
the estimator d of @ associated with the loss function [(d. @) is 


2 ~1s2 
mw) = h(d) Fag 2) | : 
" dests 


If, in particular, one uses the discrepancy measure 


ple) 
p(x |d) 


as a natural loss function (see Definition 3.15), this implies that 7(@) = h(@)!?. 
which is, again, Jeffreys’ prior. 

Good (1969) derived Jeffreys’ prior as the “least favourable” initial distribution 
with respect to a logarithmic scoring rule, in the sense that it minimises the expected 
score from reporting the true distribution. Since the logarithmic score is proper. 
and hence is maximised by reporting the true distribution, Jeffreys’ prior may 
technically be described, under suitable regularity conditions, as a minimax solution 
to the problem of scientific reporting when the utility function is the logarithmic 
score function. Kashyap (1971!) provided a similar, more detailed argument; an 
axiom system is used to justify the use of an information measure as a payoff 
function and Jeffreys’ prior is shown to be a minimax solution in a —two person— 
zero sum game, where the statistician chooses the “non-informative™ prior and 
nature chooses the “true” prior. 

Hartigan (1971, 1983, Chapter 5) defines a similarity measure for events E. F 
tobe P(ENF)/P(E)P(F) and shows that Jeffreys’ prior ensures, asymptotically, 
constant similarily for current and future observations. 

Following Jeffreys (1955), Box and Tiao (1973, Section 1.3) argued for se- 
lecting a prior by convention to be used as a standard of reference. They suggested 
that the principle of insufficient reason may be sensible in location problems, and 
proposed as a conventional prior 7(«) for a model parameter « that 7(@) which 
implies a uniform prior 


i(d.o) = [ v2\0) log 


I 


dip xe 


for a function ¢ = ¢(@) such that p(.r|¢) ts, at least approximately, a location 
parameter family; that is, such that. for some functions g and f. 


m(C) = ra) 


p(r|o) ~ g[g(e) - f(4)j. 


Using standard asymptotic theory. they showed that. under suitable regularity con- 
ditions and for large samples, this will happen if 


C(o) = [oy *ao : 
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i.e., if the non-informative prior is Jeffreys’ prior. For a recent reconsideration and 
elaboration of these ideas, see Kass (1990), who extends the analysis by condition- 
ing on an ancillary statistic. 

Unfortunately, although many of the arguments summarised above generalise 
to the multiparameter continuous case, leading to the so-called multivariate Jeffreys’ 
rule 

n(8) x | (8) |", 


where 


[()}; =~ f rel arae 50 loent0) ds 


is Fisher’s information matrix, the results thus obtained typically have intuitively 
unappealing implications. An example of this, pointed out by Jeffreys himself 
(Jeffreys, 1939/1961 p. 182) is provided by the simple location-scale problem, 
where the multivariate rule leads to 7(0,0) x a2, where @ is the location and 
the scale parameter. See, also, Stein (1962). 


Example 5.25. (Univariate normal model). Let {x1....,x,} be a random sample 
from N (a | 2.4), and consider o = A7'/?, the (unknown) standard deviation. In the case of 
known mean, yt = 0, say, the appropriate (univariate) Jeffreys’ pres is 7(a¢) x o7! and the 
posterior distribution of o would be such that [57!_,r?]/o? is Xn In the case of unknown 
mean, if we used the multivariate Jeffreys’ prior 7(42.0) x o ? the posterior distribution 
of o would be such that [D"_, (x; — F)?]/o? is, again, x?. This is widely recognised as 
unacceptable, in that one ao not lose any degrees of freedom even though one has lost 
the knowledge that 4 = 0, and conflicts with the use of the widely adopted reference prior 
m(4, a) = a7! (see Example 5.17 in Section 5.4), which implies that [27 (a; — Z)?}/o? is 


Xie I 


The kind of problem exemplified above led Jeffreys to the ad hoc recommen- 
dation, widely adopted in the literature, of independent a priori treatment of location 
and scale parameters, applying his rule separately to each of the two subgroups of 
parameters, and then multiplying the resulting forms together to arrive at the overall 
prior specification. For an illustration of this, see Geisser and Cornfield (1963): for 
an elaboration of the idea, see Zellner (1986a). 

At this point, one may wonder just what has become of the intuition motivat- 
ing the arguments outlined above. Unfortunately, although the implied information 
limits are mathematically well-defined in one dimension, in higher dimensions the 
forms obtained may depend on the path followed to obtain the limit. Similar prob- 
lems arise with other intuitively appealing desiderata. For example, the Box and 
Tiao suggestion of a uniform prior following transformation to a parametrisation 
ensuring data translation generalises, in the multiparameter setting, to the require- 
ment of uniformity following a transformation which ensures that credible regions 
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are of the same size. The problem, of course, is that. in several dimensions, such 
regions can be of the same size but very different in form. 

Jeffreys’ original requirement of invariance under reparametrisation remains 
perhaps the most intuitively convincing. If thisis conceded, it follows that, whatever 
their apparent motivating intuition. approaches which do not have this property 
should be regarded as unsatisfactory. Such approaches include the use of limiting 
forms of conjugate priors, as in Haldane (1948), Novick and Hall (1965). Novick 
(1969), DeGroot (1970. Chapter 10) and Piccinato (1973, 1977). a predictivistic 
version of the principle of insufficient reason, Geisser (1984), and different forms 
of information-theoretical arguments, such as those put forward by Zellner (1977, 
1991), Geisser (1979) and Torgesen (1981). 

Maximising the expected information (as opposed to maximising the expected 
missing information) gives invariant. but unappealing results, producing priors that 
can have finite support (Berger er al., 1989). Other information-based suggestions 
are those of Eaton (1982), Spall and Hill (1990) and Rodriguez (1991). 

Partially satisfactory results have nevertheless been obtained in multiparameter 
problems where the parameter space can be considered as a group of transformations 
of the sample space. Invariance considerations within such a group suggest the use 
of relatively invariant (Hartigan, 1964) priors like the Haar measures. This idea was 
pioneered by Barnard (1952). Stone (1965) recognised that, in an appropriate sense, 
it should be possible to approximate the results obtained using a non-informative 
prior by those obtained using a convenient sequence of proper priors. He went on 
to show that, if a group structure is present, the corresponding right Haar measure 
is the only prior for which such a desirable convergence is obtained. It is reassuring 
that, in those one-dimensional problems for which a group of transformations does 
exist, the right Haar measures coincides with the relevant Jeffreys’ prior. For some 
undesirable consequences of the /eft Haar measure see Bernardo (1978b). Further 
developments involving Haar measures are provided by Zidek (1969). Villegas 
(1969, 1971, 1977a, 1977b, 1981), Stone (1970), Florens (1978, 1982), Chang and 
Villegas (1986) and Chang and Eaves (1990). Dawid (1983b) provides an excellent 
review of work up to the early 1980°s. However. a large group of interesting models 
do not have any group structure, so that these arguments cannot produce general 
solutions. 

Even when the parameter space may be considered as a group of transforma- 
tions there is no definitive answer. In such situations, the right Haar measures are 
the obvious choices and yet even these are open to criticism. 


Example 5.26. (Standardised mean). Let x = {.r)...... r,} be a random sample 
from anormal distribution V(.r | jz. A). The standard prior recommended by group invariance 
arguments is (j:,0) = o~' where \ = a~*. Although this gives adequate results if one 
wants to make inferences about either j: or 7, it is quite unsatisfactory if inferences about the 
standardised mean @ = j:/a@ are required. Stone and Dawid (1972) show that the posterior 
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distribution of ¢ obtained from such a prior depends on the data through the statistic 


whose sampling distribution, 


p(t| 2,0) = p(t|d) 


A #2 (a —3)/2 x w? 
=e wnt 5} [ wi" exp |= + tow) de 
Th 0 2 


only depends on @. One would, therefore, expect to be able to “match” the original inferences 
about ¢ by the use of p(t | ¢) together with some appropriate prior for @. However, no such 
prior exists. 
On the other hand, the reference prior relative to the ordered partition (, 7) is (see 
Example 5.18) 
m(o,0) =(2+ 67) '?0"! 
and the corresponding posterior distribution for ¢ is 


2 


n(o|x) x (2+ 67)" ee wr exp { -= + tow} 
v 2 
We observe that the factor in square brackets is proportional to p(t | @) and thus the incon- 
sistency disappears. 


This type of marginalisation paradox, further explored by Dawid, Stone and 
Zidek (1973), appears in a large number of multivariate problems and makes it 
difficult to believe that, for any given model, a single prior may be usefully regarded 
as “universally” non-informative. Jaynes (1980) disagrees, 

An acceptable general theory for non-informative priors should be able to 
provide consistent answers to the same inference problem whenever this is posed 
in different, but equivalent forms. Although this idea has failed to produce a 
constructive procedure for deriving priors, it may be used to discard those methods 
which fail to satisfy this rather intuitive requirement. 


Example 5.27. (Correlation coefficient). Let (x.y) = {(x1-41).--- «(ns Yn) } bea 
random sample from a bivariate normal distribution, and suppose that inferences about the 
correlation coefficient p are required. It may be shown that if the prior is of the form 

(fe, H2,01.02,p) = T(p)a,"a,". 
which includes all proposed “non-informative” priors for this model that we are aware of, 
then the posterior distribution of » is given by 
m(p|x,y) = m(p|r) 


J — pryerr2a- a2 » | 
= OF (} Lanta— 4, tr). 
— pr C7 Baa Sell CL ee 4 
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where 
: ye — nFy 

[Sil = FPA (ye — wy]? 
is the sample correlation coefficient, and F is the hypergeometric function. This posterior 
distribution only depends on the data through the sample correlation coefficient r: thus, with 
this form of prior, r is sufficient. On the other hand, the sampling distribution of r is 


r 


P(r pa. fia. oy. 02.p) = p(r |p) 
_ el — py" Hy mel p- rey a 7 (4. ie = L I te) 


(I — pry" w2 ay 2 


Moreover, using the transformations 6 = tanh ‘pandf = tanh 'r, Jeffreys’ prior for this 
univariate model is found to be 7() x (1 — p*) | (see Lindley. 1965, pp. 215-219). 

Hence one would expect to be able to match, using this reduced model. the posterior 
distribution 7({ 1) given previously. so that 


a(plr) x p(r!p) — p’) |. 


Comparison between (p|r) and p(r |») shows that this is possible if and only if a = 1, 
and 1(p) = (1 — p*) '. Hence, to avoid inconsistency the joint reference prior must be of 
the form 
(fly. pty). 2. p) = (1 — p*)'aylaz!. 
which is precisely (see Example 5.22, p. 337) the reference prior relative to the natural order, 
{p-fiy. ftg. Oy. 2}. 
However. it is easily checked that Jeffreys’ multivariate prior is 


W(t. fy. O1. 02. p) = (1 — p*) Papas? 


and that the “two-step” Jeffreys’ multivariate prior which separates the location and scale 
parameters is 
$ 


T(jt. f2)T(ay.0.p) =(L— po?) Po, 'o,'. 


For further detailed discussion of this example. see Bayarri (1981). 


Once again, this example suggests that different non-informative priors may 
be appropriate depending on the particular function of interest or, more generally. 
on the ordering of the parameters. 

Although marginalisation paradoxes disappear when one uses proper priors, to 
use proper approximations to non-informative priors as an approximate description 
of “ignorance” does not solve the problem either. 


5.6 Discussion and Further References 365 


Example 5.28. (Stein’s paradox). Let x = {Z)..... z,} be a random sample from 
a multivariate normal distribution N,.(a | 2. 1,}. Let ¥, be the mean of the 7 observations 
from coordinate i and let t = >, £7. The universally recommended “non-informative” prior 
for this model is m(pe,.... , 444.) = 1, which may be approximated by the proper density 


a 


T(ftt..... pc) = T] N(ee:[ 0.) 


rel 


where 2 is very small. However, if inferences about @ = 5~, 7 are desired, the use of this 
prior overwhelms, for large k, what the data have to say about @. Indeed, with such a prior 
the posterior distribution of n¢@ is a non-central x? distribution with k degrees of freedom 
and non-centrality parameter nt, so that 


k  viotay=2 [oe 
E{o|a]=t+—. Volz] =~ jat+S] 


while the sampling distribution of nt is a non-central ? distribution with & degrees of 
freedom and parameter 4 so that E[t|¢] = @+k/n. Thus, with, say, & = 100, 2 = 1 and 
t = 200, we have E[d| x] = 300, V[| x] = 32?, whereas the unbiased estimator based on 
the sampling distribution gives @ = t — k © 100. : 

However, the asymptotic posterior distribution of ¢ is N(@| @. (4¢)~!) and hence, by 
Proposition 5.2, the reference posterior for @ relative to p(t |) is 


m(p| x) x m(d)p(t | d) x O '?x?(nt | k. nd) 


whose mode is close to o . It may be shown that this is also the posterior distribution of @ 
derived from the reference prior relative to the ordered partition {¢,w1,...,ws—1}, obtained 
by reparametrising to polar coordinates in the full model. For further details, see Stein 
(1959), Efron (1973), Bernardo (1979b) and Ferrandiz (1982). 


Naive use of apparently “non-informative” prior distributions can lead to pos- 
terior distributions whose corresponding credible regions have untenable coverage 
probabilities, in the sense that, for some region C,, the corresponding posterior 
probabilities P(C'| z) may be completely different from the conditional values 
P(C [9) for almost all 6 values. 

Such a phenomenon is often referred to as strong inconsistency (see, for ex- 
ample, Stone, 1976). However, by carefully distinguishing between parameters of 
interest and nuisance parameters, reference analysis avoids this type of inconsis- 
tency. An illuminating example is provided by the reanalysis by Bernardo (1979b, 
reply to the discussion) of Stone’s (1976) Flatland example. For further discussion 
of strong inconsistency and related topics, see Appendix B, Section 3.2. 

Jaynes (1968) introduced a more general formulation of the problem. He 
allowed for the existence of a certain amount of initial “objective” information and 
then tried to determine a prior which reflected this initial information, but nothing 
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else (see, also, Csiszdr, 1985). Jaynes considered the entropy of a distribution to be 
the appropriate measure of uncertainty subject to any “objective” information one 
might have. If no such information exists and © can only take a finite number of 
values, Jaynes’ maximum entropy solution reduces to the Bayes-Laplace postulate. 
His arguments are quite convincing in the finite case; however, if © is continuous, 
the non-invariant entropy functional. H {p()} = — f p(o) log p(e)do, no longer 
has a sensible interpretation in terms of uncertainty. Jaynes’ solution is to introduce 
a “reference” density 7(«) in order to define an “invariantised™ entropy. 
p(9) 


. i MONS a) 


and to use the prior which maximises this expression, subject. again, to any initial 
“objective” information one might have. Unfortunately, 7(@) must itself be a rep- 
resentation of ignorance about @ so that no progress has been made. If a convenient 
group of transformations is present. Jaynes suggests invariance arguments to select 
the reference density. However, no general procedure is proposed. 

Context-specific “non-informative” Bayesian analyses have been produced for 
specific classes of problems, with no attempt to provide a general theory. These in- 
clude dynamic models (Pole and West, 1989) and finite population survey sampling 
(Meeden and Vardeman, 1991). 


do. 


The quest for non-informative priors could be summarised as follows. 

(i) In the finite case, Jaynes’ principle of maximising the entropy is convincing, 
but cannot be extended to the continuous case. 

(ii) In one-dimensional continuous regular problems, Jeffreys’ prior is appropriate. 


(iii) The infinite discrete case can often be handled by suitably embedding the 
problem within a continuous framework. 


(iv) In continuous multiparameter situations there is no hope for a single, unique, 
“non-informative prior”, appropriate for all the inference problems within a 
given model. To avoid having the prior dominating the posterior for some 
function © of interest. the prior has to depend not only on the mode! but also 
on the parameter of interest or, more generally. on some notion of the order 
of importance of the parameters. 


The reference prior theory introduced in Bernardo (1979b) and developed in 
detail in Section 5.4 avoids most of the problems encountered with other proposals. 
It reduces to Jaynes’ form in the finite case and to Jeffreys’ form in one-dimensional 
regular continuous problems, avoiding marginalisation paradoxes by insisting that 
the reference prior be tailored to the parameter of interest. However, subsequent 
work by Berger and Bernardo (1989) has shown that the heuristic arguments in 
Bernardo (1979b) can be misleading in complicated situations, thus necessitating 
more precise definitions. Moreover. Berger and Bernardo (1992a, 1992b. 1992c) 
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showed that the partition into parameters of interest and nuisance parameter may 
not go far enough and that reference priors should be viewed relative to a given 
ordering —or, more generally, a given ordered grouping —of the parameters. This 
approach was described in detail in Section 5.4. Ye (1993) derives reference priors 
for sequential experiments. 

A completely different objection to such approaches to non-informative priors 
lies in the fact that, for continuous parameters, they depend on the likelihood func- 
tion. This is recognised to be potentially inconsistent with a personal interpretation 
of probability. For many subjectivists, the initial density p() is a description of 
the opinions held about ¢, independent of the experiment performed; 


why should one’s knowledge, or ignorance, of a quantity depend on the experi- 
ment being used to determine it? Lindley (1972, p. 71). 


In many situations, we would accept this argument. However, as we argued 
earlier, priors which reflect knowledge of the experiment can sometimes be gen- 
uinely appropriate in Bayesian inference, and may also have a useful role to play 
(see, for example, the discussion of stopping rules in Section 5.1.4) as technical 
devices to produce reference posteriors. Posteriors obtained from actual prior opin- 
ions could then be compared with those derived from a reference analysis in order 
to assess the relative importance of the initial opinions on the final inference. 


In general we feel that it is sensible to choose a non-informative prior which 
expresses ignorance relative to information which can be supplied by a particular 
experiment. If the experiment is changed. then the expression of relative igno- 
rance can be expected to change correspondingly. (Box and Tiao, 1973, p. 46). 


Finally, “non-informative” distributions have sometimes been criticised on the 
grounds that they are typically improper and may lead, for instance, to inadmissi- 
ble estimates (see, e.g. Stein, 1956). However, sensible “‘non-informative” priors 
may be seen to be, in an appropriate sense, limits of proper priors (Stone, 1963, 
1965, 1970; Stein, 1965; Akaike, 1980a). Regarded as a “baseline” for admissible 
inferences, posterior distributions derived from “non-informative” priors need not 
be themselves admissible, but only arbitrarily close to admissible posteriors. 

However, there can be no final word on this topic! For example, recent work 
by Eaton (1992), Clarke and Wasserman (1993), George and McCulloch (1993b) 
and Ye (1993) seems to open up new perspectives and directions. 


5.6.3 Robustness 


In Section 4.8.3, we noted that some aspects of model specification, either for the 
parametric model or the prior distribution components, can seem arbitrary, and cited 
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as an example the case of the choice between normal and Student-t distributions 
as a parametric model component to represent departures of observables from their 
conditional expected values. In this section. we shall provide some discussion of 
how insight and guidance into appropriate choices might be obtained. 

We begin our discussion with a simple, direct approach to examining the ways 
in which a posterior density for a parameter depends on the choices of parametric 
model or prior distribution components. Consider, for simplicity, a single observ- 
able xx € # having a parametric density p(a|@), with 6 € FR having prior density 
p(@). The mechanism of Bayes’ theorem, 


__ pxi8)o(0) 


P(r) AGA 


involves multiplication of the two model components, p(2|@). p(@). followed by 
normalisation, a somewhat “opaque” operation from the point of view of comparing 
specifications of p(-z|9) or p(@) on a “what if?” basis. 

However, suppose we take logarithms in Bayes’ theorem and subsequently 
differentiate with respect to 6. This now results in a /inear form 


7) a 
) 
= 7 tog p(4| r= 36 log p(a|@) + 39 18 p(@). 


The first term on the right-hand side is (apart from a sign change) a quantity known 
in classical statistics as the efficient score function (see, for example, Cox and 
Hinkley, 1974). On the linear scale, this is the quantity which transforms the prior 
into the posterior and hence opens the way, perhaps, to insight into the effect of a 
particular choice of p(.r|@) given the form of p(8). See, for example. Ramsey and 
Novick (1980) and Smith (1983). Conversely. examination of the second term on 
the right-hand side for given p(7|@) may provide insight into the implications of 
the mathematical specification of the prior. 

For convenience of exposition —and perhaps because the prior component is 
often felt to be the less secure element in the model specification —we shall focus 
the following discussion on the sensitivity of characteristics of (0 | :-) to the choice 
of p(@). Similar ideas apply to the choice of p(x | 8). 

With 2 denoting the mean of 1: independent observables from a normal distri- 
bution with mean @ and precision 4, we shall illustrate these ideas by considering 
the form of the posterior mean for # when p(.rj@) = NGrl@. aA) and p(@) is of 
“arbitrary” form. 

Defining 


ple) = [ v(sloyp(ee. 


O log p(.r) 


AES On 
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it can be shown (see, for example, Pericchi and Smith, 1992) that 
E(6|z) = 2 —n7'd7's(z). 


Suppose we carry out a “what if?” analysis by asking how the behaviour of 
the posterior mean depends on the mathematical form adopted for p(@). 

What if we take p(@) to be normal? With p(@) = N(6{u, Ao), the reader can 
easily verify that in this case p(x) will be normal, and hence s(z) will be a linear 
combination of z and the prior mean. The formula given for E(6|:r) therefore 
reproduces the weighted average of sample and prior means that we obtained in 
Section 5.2, so that 


E(6|xz) = (nd + Ap)! (nAz + Agu). 


What if we take p(@) to be Student-t? With p(@) = St(6|u, Ay, a) the ex- 
act treatment of p(x) and s(x) becomes intractable. However, detailed analysis 
(Pericchi and Smith, 1992) provides the approximation 


pss co OEE Sg 
P0le) = 2 ~ Tad! +(e = HA 


What if we take p(@) to be double-exponential? In this case, 


for some v > 0, uw € ¥ and the evaluation of p(x) and s(x) is possible, but 
tedious. After some algebra—see Pericchi and Smith (1992)—it can be shown 
that, if b= n'y! 72, 


E(O|x) = w(x)(z + b) + [1 - w(z)|(z - 4), 
where w(x) is a weight function, 0 < w(x) < 1, so that 
r-b< E(6lx) << rt+d. 


Examination of the three forms for £(0|z) reveals striking qualitative differ- 
ences. In the case of the normal, the posterior mean is unbounded in x — yp, the 
departure of the observed mean from the prior mean. In the case of the Student-t, 
we see that for very small x — y the posterior mean is approximately linear in x — p4, 
like the normal, whereas for x — ys very large the posterior mean approaches z. 
In the case of the double-exponential, the posterior mean is bounded, with limits 
equal to x plus or minus a constant. 
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Consideration of these qualitative differences might provide guidance regard- 
ing an otherwise arbitrary choice if, for example. one knew how one would like the 
Bayesian learning mechanism to react to an “outlying” .r. which was far from y. 
See Smith (1983) and Pericchi er a/. (1993) for further discussion and elaboration. 
See Jeffreys (1939/1961) for seminal ideas relating to the effect of the tail-weight 
of the distribution of the parametric model on posterior inferences. Other relevant 
references include Masreliez (1975), O’ Hagan (1979, 1981, 1988b), West (1981). 
Main (1988), Polson (1991), Gordon and Smith (1993) and O’ Hagan and Le (1994). 

The approach illustrated above is well-suited to probing qualitative differ- 
ences in the posterior by considering, individually, the effects of a small number 
of potential alternative choices of model component (parametric model or prior 
distribution). 

Suppose, instead, that someone has in mind a specific candidate component 
specification, py. say, but is all too aware that aspects of the specification have 
involved somewhat arbitrary choices. It is then natural to be concerned about 
whether posterior conclusions might be highly sensitive to the particular specitfi- 
cation py. Viewed in the context of alternative choices in an appropriately defined 
neighbourhood of po. 

In the case of specifying a parametric component pa—for example an “er- 
ror” model for differences between observables and their (conditional) expected 
values — such concern might be motivated by definite knowledge of symmetry and 
unimodality, but an awareness of the arbitrariness of choosing a conventional distri- 
butional form such as normality. Here, a suitable neighbourhood might be formed 
by taking py to be normal and forming a class of distributions whose tail-weights 
deviate (lighter and heavier) from normal: see, for example. the seminal papers of 
Box and Tiao (1962. 1964). 

In the case of specifying a prior component po, such concern might be mo- 
tivated by the fact that elicitation of prior opinion has only partly determined the 
specification (for example, by identifying a few quantiles), with considerable re- 
maining arbitrariness in “filling out™ the rest of the distribution. Here, a suitable 
neighbourhood of pp might consist of a class of priors all having the specitied quan- 
tiles but with other characteristics varying: see. for example, O'Hagan and Berger 
(1988). 

From a mathematical perspective, this formulation of the robustness problem 
presents some intriguing challenges. How to formulate interesting neighbourhood 
classes of distributions? How to calculate, with respect to such prior classes. bounds 
on posterior quantities of interest such as expectations or probabilities? 

At the time of writing, this is an area of intensive research. For example. 
should neighbourhoods be defined parametrically or non-parametrically? And, it 
nonparametrically, what measures of distance should be used to define a neighbour- 
hood “close” to py? Should the elements, p. of the neighbourhood be those such 
that the density ratio p/p» is bounded in some sense? Or such that the maximum 
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difference in the probability assigned to any event under p and pp is bounded? Or 
such that p can be written as a “contamination” of po, p = (1 — €)pp + €q, for small 
é and g belonging to a suitable class? 

As yet, few issues seem to be resolved and we shall not, therefore, attept a de- 
tailed overview. Relevant references include; Edwards et al. (1963), Dawid (1973), 
Dempster (1975), Hill (1975), Meeden and Isaacson (1977), Rubin (1977, 1988a, 
1988b), Kadane and Chuang (1978), Berger (1980, 1982, 1985a), DeRobertis and 
Hartigan (1981), Hartigan (1983), Kadane (1984), Berger and Berliner (1986), 
Kempthorne (1986), Berger and O’Hagan (1988), Cuevas and Sanz (1988), Peric- 
chi and Nazaret (1988), Polasek and Pétzelberger (1988, 1994), Carlin and Demp- 
ster (1989), Delampady (1989), Sivaganesan and Berger (1989, 1993), Wasserman 
(1989, 1992a, 1992b), Berliner and Goel (1990), Delampady and Berger (1990), 
Doksum and Lo (1990), Wasserman and Kadane (1990, 1992a, 1992b), Rios (1990, 
1992), Angers and Berger (1991), Berger and Fan (1991), Berger and Mortera 
(1991b, 1994), Lavine (1991a, 1991b, 1992a, 1992b, 1994), Lavine er al. (1991, 
1993), Moreno and Cano (1991), Pericchi and Walley (1991), Pétzelberger and Po- 
lasek (1991), Sivaganesan (1991), Walley (1991), Berger and Jefferys (1992), Gilio 
(1992b), Gémez- Villegas and Main (1992), Moreno and Pericchi (1992, 1993), 
Nau (1992), Sans6 and Pericchi (1992), Liseo er al. (1993), Osiewalski and Steel 
(1993), Bayarri and Berger (1994), de la Horra and Fernandez (1994), Delampady 
and Dey (1994), O’Hagan (1994b), Pericchi and Pérez (1994), Rios and Martin 
(1994), Salinetti (1994). There are excellent reviews by Berger (1984a, 1985a, 
1990, 1994) and Wasserman (1992a), which together provide a wealth of further 
references. 

Finally, in the case of a large data sample, one might wonder whether the data 
themselves could be used to suggest a suitable form of parametric model component, 
thus removing the need for detailed specification and hence the arbitrariness of the 
choice. The so-called Bayesian bootstrap provides such a possible approach; see, 
for instance, Rubin (1981) and Lo (1987, 1993). However, since it is a heavily 
computationally based method we shall defer discussion to the volume Bayesian 
Computation. 


The term Bootstrap is more familiar to most statisticians as a computationally 
intensive frequentist data-based simulation method for statistical inference; in 
particular, as a computer-based method for assigning frequentist measures of 
accuracy to point estimates. For an introduction to the method—and to the related 
technique of jackknifing —see Efron (1982). For a recent textbook treatment, see 
Efron and Tibshirani (1993). See, also, Hartigan (1969, 1975). 


5.6.4 Hierarchical and Empirical Bayes 


In Section 4.6.5, we motivated and discussed model structures which take the form 
of an hierarchy. Expressed in terms of generic densities, a simple version of such 
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an hierarchical model has the form 


H 
p(x|0) = p(xi.....4|01.....Ox) = [] a(xil6i). 


P(|d) = p(61...-. Ox) = [] (6.19). 
P(). 


The basic interpretation is as follows. Observables x,,.... x, are available 
from k different, but related, sources: for example, k individuals in a homogeneous 
population, or & clinical trial centres involved in the same study. The first stage of 
the hierarchy specifies parametric model components for each of the & observables. 
But because of the “relatedness” of the & observables, the parameters 0)...., 0; 
are themselves judged to be exchangeable. The second and third stages of the 
hierarchy thus provide a prior for @ of the familiar mixture representation form 


; 
(8) = 2(01.---.61) = | T] x@sleyp(oyae, 
“gz 


Here, the “hyperparameter” @ typically has an interpretation in terms of 
characteristics —for example, mean and covariance —of the population (of indi- 
viduals, trial centres) from which the & units are drawn. 

In many applications, it may be of interest to make inferences both about the 
unit characteristics, the 8;’s, and the population characteristics, @. In either case, 
straightforward probability manipulations involving Bayes’ theorem provide the 
required posterior inferences as follows: 


p(6)|2) = p(0:|6.2)p(dla)ad. 


where 
p(9;|b, x) x p(x|O;)p(Oi|p) 
p(Plx) x p(zld)p(¢). 
and 
red) = f pixie)p(6\6)a8. 
Of course, actual implementation requires the evaluation of the appropriate 
integrals and this may be non-trivial in many cases. However, as we shall see in 


the volumes Bayesian Computation and Bayesian Methods, such models can be 
implemented in a fully Bayesian way using appropriate computational techniques. 
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A detailed analysis of hierarchical models will be provided in those volumes; some 
key references are Good (1965, 1980b), Ericson (1969a, 1969b), Hill (1969, 1974), 
Lindley (1971), Lindley and Smith (1972), Smith (1973a, 1973b), Goldstein and 
Smith (1974), Leonard (1975), Mouchart and Simar (1980), Goel and DeGroot 
(1981), Goel (1983), Dawid (1988b), Berger and Robert (1990), Pérez and Pericchi 
(1992), Schervish et al. (1992), van der Merwe and van der Merwe (1992), Wolpert 
and Warren-Hicks (1992) and George et al. (1993, 1994). 

A tempting approximation is suggested by the first line of the analysis above. 
We note that if p(@|x) were fairly sharply peaked around its mode, @*, say, we 
would have 


p(O,|a) = p(@; |". x). 


The form that results can be thought of as if we first use the data to estimate @ and 
then, with @’ as a “plug-in” value, use Bayes’ theorem for the first two stages of 
the hierarchy. The analysis thus has the flavour of a Bayesian analysis, but with an 
“empirical” prior based on the data. 

Such short-cut approximations to a fully Bayesian analysis of hierarchical 
models have become known as Empirical Bayes methods. This is actually slightly 
confusing, since the term was originally used to describe frequentist estimation of 
the second-stage distribution: see Robbins (1955, 1964, 1983). However, more 
recently, following the line of development of Efron and Morris (1972, 1975) and 
Morris (1983), the term has come to refer mainly to work aimed at approximating 
(aspects of) posterior distributions arising from hierarchical models. 

The naive approximation outlined above is clearly deficient in that it ig- 
nores uncertainty in @. Much of the development following Morris (1983) has 
been directed to finding more defensible approximations. For more whole-hearted 
Bayesian approaches, see Deely and Lindley (1981), Gilliland er a/. (1982), Kass 
and Steffey (1989) and Ghosh (1992a). An eclectic account of empirical Bayes 
methods is given by Maritz and Lwin (1989). 


5.6.5 Further Methodological Developments 


The distinction between theory and methods is not always clear-cut and the extensive 
Bayesian literature on specific methodological topics obviously includes a wealth 
of material relating to Bayesian concepts and theory. We shall review this material 
in the volume Bayesian Methods and confine ourselves here to simply providing a 
few references. 

Among the areas which have stimulated the development of Bayesian theory, 
we note the following: Actuarial Science and Insurance (Jewell, 1974, 1988; 
Singpurwalla and Wilson, 1992), Calibration (Dunsmore, 1968; Hoadley, 1970; 
Brown and Mikelainen, 1992), Classification and Discrimination (Geisser, 1964, 
1966; Binder, 1978; Bernardo, 1988, 1994; Bernardo and Girén, 1989; Dawid 
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and Fang, 1992), Contingency Tables (Lindley, 1964; Good, 1965, 1967, Leonard, 
1975; Leonard and Hsu, 1994), Control Theory (Aoki, 1967; Sawagari et al., 1967), 
Econometrics (Mills, 1992; Steel, 1992), Finite Population Sampling (Basu, 1969. 
1971; Ericson, 1969b, 1988; Godambe, 1969, 1970; Smouse, 1984. Lo, 1986), 
Image Analysis (Geman and Geman, 1984, Besag, 1986, 1989; Geman, 1988, 
Mardia et al., 1992, Grenander and Miller, 1994), Law (Dawid, 1994), Meta- 
Analysis (DuMouchel and Harris, 1992, Wolpert and Warren-Hicks, 1992), Missing 
Data (Little and Rubin, 1987, Rubin, 1987, Meng and Rubin, 1992), Mixtures 
(Titterington et a/., 1985; Berliner, 1987, Bernardo and Girdén, 1988, Florens et 
al., 1992; West. 1992b: Diebolt and Robert, 1994, Robert and Soubiran, 1993: 
West et al., 1994), Multivariate Analysis (Brown et al.. 1994), Quality Assurance 
(Wetherill and Campling, 1966; Hald, 1968; Booth and Smith, 1976; Irony et al. 
1992: Singpurwalla and Soyer, 1992), Splines (Wahba, 1978, 1983. 1988. Gu, 
1992: Ansley et al., 1993, Cox, 1993), Stochastic Approximation (Makov. 1988) 
and Time Series and Forecasting (Meinhold and Singpurwalla, 1983, West and 
Migon, 1985; Mortera, 1986; Smith and Gathercole, 1986, West and Harrison, 
1986, 1989; Harrison and West, 1987; Ameen, 1992; Carlin and Polson, 1992; 
Gamerman, 1992; Smith. 1992; Gamerman and Migon, 1993; McCulloch and 
Tsay, 1993; Pole et al.. 1994). 


5.6.6 Critical Issues 


We conclude this chapter on inference by briefly discussing some further issues 
under the headings: (i) Model Conditioned Inference, (ii) Prior Elicitation, (ui) 
Sequential Methods and (iv) Comparative Inference. 


Model Conditioned Inference 


We have remarked on several occasions that the Bayesian learning process is pred- 
icated on a more or less formal framework. In this chapter, this has translated into 
model conditioned inference, in the sense that all prior to posterior or predictive 
inferences have taken place within the closed world of an assumed model structure. 

It has therefore to be frankly acknowledged and recognised that all such in- 
ference is conditional. /f we accept the model, shen the mechanics of Bayesian 
learning — derived ultimately from the requirements of quantitative coherence — 
provide the appropriate uncertainty accounting and dynamics. 

But what if, as individuals, we acknowledge some insecurity about the model? 
Or need to communicate with other individuals whose own models differ? 

Clearly, issues of model criticism, model comparison, and, ultimately, model 
choice, are as much a part of the general world of confronting uncertainty as 
model conditioned thinking. We shall therefore devote Chapter 6 to a systematic 
exploration of these issues. 
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Prior Elicitation 


We have emphasised, over and over, that our interpretation of a model requires —in 
conventional parametric representations — both a likelihood and a prior. 

In accounts of Bayesian Statistics from a theoretical perspective — like that of 
this volume—discussions of the prior component inevitably focus on stylised forms, 
such as conjugate or reference specifications, which are amenable to a mathematical 
treatment, thus enabling general results and insights to be developed. 

However, there is a danger of losing sight of the fact that, in real applications, 
prior specifications should be encapsulations of actual beliefs rather than stylised 
forms. This, of course, leads to the problem of how to elicit and encode such beliefs, 
i.e., how to structure questions to an individual, and how to process the answers, in 
order to arrive at a formal representation. 

Much has been written on this topic, which clearly goes beyond the boundaries 
of statistical formalism and has proved of interest and importance to researchers 
from a number of other disciplines, including psychology and economics. However, 
despite its importance, the topic has a focus and flavour substantially different from 
the main technical concems of this volume, and will be better discussed in the 
volume Bayesian Methods. 

We shall therefore not attempt here any kind of systematic review of the very 
extensive literature. Very briefly, from the perspective of applications the best 
known protocol seems to be that described by Stéel von Holstein and Matheson 
(1979), the use of which in a large number of case studies has been reviewed by 
Merkhofer (1987). General discussion in a text-book setting is provided, for exam- 
ple, by Morgan and Henrion (1990), and Goodwin and Wright (1991). Warnings 
about the problems and difficulties are given in Kahneman et al. (1982). Some key 
references are de Finetti (1967), Winkler (1967a, 1967b), Edwards et al. (1968), 
Hogarth (1975, 1980) Dickey (1980), French (1980), Kadane (1980), Lindley 
(1982d), Jaynes (1985), Garthwaite and Dickey (1992), Leonard and Hsu (1992) 
and West and Crosse (1992). 


Sequential Methods 


In Section 2.6 we gave a brief overview of sequential decision problems but for most 
of our developments, we assumed that data were treated globally. It is obvious, 
however, that data are often available in sequential form and, moreover, there are 
often computational advantages in processing data sequentially, even if they are all 
immediately available. 

There is a large Bayesian literature on sequential analysis and on sequen- 
tial computation, which we will review in the volumes Bayesian Computation 
and Bayesian Methods. Key references include the seminal monograph of Wald 
(1947), Jackson (1960), who provides a bibliography of early work, Wetherill 
(1961), and the classic texts of Wetherill (1966) and DeGroot (1970). Berger and 
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Berry (1988) discuss the relevance of stopping rules in statistical inference. Some 
other references, primarily dealing with the analysis of stopping rules, are Amster 
(1963). Barnard (1967). Bartholomew (1967), Roberts (1967). Basu (1975) and 
Irony (1993). Witmer (1986) reviews multistage decision problems. 


Comparative Inference 


In this and in other chapters, our main concern has been to provide a self-contained 
systematic development of Bayesian ideas. However. both for completeness. and 
for the very obvious reason that there are still some statisticians who do not currently 
subscribe to the position adopted here. it seems necessary to make some attempt to 
compare and contrast Bayesian and non-Bayesian approaches. 

We shall therefore provide. in Appendix B, a condensed critical overview of 
mainstream non-Bayesian ideas and developments. Any reader for whom our treat- 
ment is too condensed, should consult Thatcher (1964). Pratt (1965). Bartholomew 
(1971). Press (1972/1982), Barnett (1973/1982), Cox and Hinkley (1974), Box 
(1983). Anderson (1984), Casella and Berger (1987, 1990), DeGroot (1987). Pic- 
cinato (1992) and Poirier (1993). 
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Chapter 6 


Remodelling 


Summary 


It is argued that whether viewed from the perspective of a sensitive individual 
modeller, or from that of a group of modellers, there are good reasons for sys- 
tematically entertaining a range of possible belief models. A variety of decision 
problems are examined within this framework: some involving model choice 
only; some involving model choice followed by a terminal action, such as pre- 
diction; other involving only a terminal action. Throughout, a clear distinction is 
drawn between three rather different perspectives: first, the case where the range 
of modeis under consideration is assumed to include the “‘true” belief model, 
secondly, the case where the range of models is being considered in order to pro- 
vide a proxy for a specified, but intractable, actual belief model; finally, the case 
where the range of models is being considered in the absence of specification of 
an actual belief model. Links with hypothesis testing, significance testing and 
cross-validation are established. 


6.1 MODEL COMPARISON 
6.1.1 Ranges of Models 


We recall from Chapter 4 that our ultimate modelling concern is with predictive be- 
liefs for sequences of observables. More specifically, most of our detailed develop- 
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ment has centred on belief models corresponding to judgements of exchangeability 
or, more generally, various forms of partial exchangeability. 

In such cases. the predictive model typically has a mixture representation in 
terms of a random sample from a labelled model, together with a prior distribu- 
tion for the label, the latter being interpretable in terms of a strong law limit of 
observables. For example, we saw that for an exchangeable real-valued sequence. 
a predictive belief distribution, P, has the general representation 


Pl@iccssc: t,) = - [J FlendQcr). 
t=t 


This corresponds to an (as if) assumption of a random sample from the unknown 
distribution function, F’, together with a prior distribution. Q. for F’. defined over 
the space, F, of all distribution functions on R. 

However, the very general nature of this representation precludes it —at least in 
terms of current limitations on intuition and technique — from providing a practical 
basis for routine concrete applications. This is why, in Chapter 4, much of our 
subsequent development was based on formal assumptions of further invariance 
or sufficiency structure, or pragmatic appeal to historical experience or scientific 
authority, in order to replace the general representation by mixtures involving finite- 
parameter families of densities. 

Inescapably, however. this passage from the general, but intractable. form to 
a specific, but tractable, model involves judgements and assumptions going far be- 
yond the simple initial judgement of exchangeability. These further judgements. 
and hence the models that result from them, are therefore typically much less se- 
curely based in terms of individual beliefs. and certainly much less likely to be 
mutually acceptable in an interpersonal context, than the straightforward symme- 
try judgement. Both from the perspective of a sensitive individual modeller and 
also from that of a group of modellers, there are therefore strong reasons for system- 
atically entertaining a range of possible models (see. for example, Dickey. 1973. 
and Smith, 1986). 

Given the assumption of exchangeability, a range of different belief models. 
P,. Py....can each be represented in the general form 


Pi(a.....2n) = vA F(x,)dQ,(F). 
te} 


for some Q;. Qo, ..., the latter encapsulating the particular alternative judgements 
that characterise the different models. The following stylised examples serve to 
illustrate some of the kinds of ranges of models that might be entertained in appli- 
cations involving simple exchangeability judgements. In each case. the range of 
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models can either be thought of as generated by a single, non-dogmatic individ- 
ual (seeking to avoid commitment to one specific form); or generated as concrete 
suggestions by a group of individuals (each committed to one of the forms); or 
generated purely formally, as an imaginative proxy for models thought likely to 
correspond to the ranges of judgements which might be made by the eventual 
readership of inference reports based on the models. In general, our subsequent 
development will be expressed in terms of a possibly infinite sequence of models 
P,, Po,...; in practice, we typically only work with a finite range, P,,..., Py for 
some k > 2. 


Inference for a Location Parameter 
Suppose that observations z1,...,Zp,..., can be thought of, conditional on 


as measurements of yz with errors €),...,€n,.--+, SO that 
Zopte, t=1,...,n, 


with e€),€2,..., exchangeable. Various beliefs are then possible about the “error 
distribution. For example, appeal to the central limit theorem (Section 3.2.3) might 
suggest the assumption of normality; however, past experience might suggest a 
substantial proportion of “aberrant” or “outlying” measurements, thus requiring 
a distribution with heavier tails than normality; different past experience might 
suggest that the experimenter automatically suppresses any observations suspected 
of being “aberrant, thus requiring the assumption of a distribution with lighter tails 
than normality. With k = 3, and using density representations throughout, a choice 
of a range of models to cover these possibilities might be: 


pier = ff []est2slao)p,(ao\dudo, 


where i , 
= 2 
pile lao) = Gem exp {55 (eh, rER 
plcline) = exp} 22 le —alh rER 
and 


p3(z| 4,0) = aie" re (u - V30,n + v30) 


30 
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with p;(#.0). j = 1.2.3, specifying prior beliefs for the location and scale pa- 
rameters appearing in these normal, double-exponential and uniform parametric 
models. Thus p;(u.0) = dQ;(F) corresponds to a belief over F which assigns 
probability one to the family with parametric form p(- | je. 7), with density p, (js. 7) 
for the two parameters of this family. If these modelling possibilities emanate froma 
single individual, p,(jz.0) might not depend on j; in general, however, the p, (#1. 7) 
could differ, even though, in this case. the interpretations of the parameter as strong 
law limits of observable measures of the location and spread of the measurements 
are the same. 


Normality versus non-Normality 


Suppose that V C F is the set of all normal distributions on the real line, and hence 
that V° = F — N is the set of all distributions other than normal. Then. given 
the assumption of exchangeability for a real-valued sequence, an individual dog- 
matically asserting normality is specifying. in the general representation. a Q(F) 
which concentrates with probability one on A’. Conversely. an individual dog- 
matically asserting non-normality is specifying a Q2(F’) which concentrates with 
probability one on .V°. Our purpose here is mainly to point out how choices within 
the general exchangeable framework correspond to specification of Q. However, 
given the “size” of F, one cannot but be struck by the monumental dogmatism 
implicit in Q,! 


Parametric Hypotheses 


Suppose that Q).j = 1..... k, are even more dogmatic, in that they not only all 
focus on a single parametric family, p(x |), but, within the family, they specify 
Oiirnes 6;,, respectively. as the values of the parameter, so that 


Py (Biseves Ln) = [[-@ |8,). 


paz] 


If k = 2, this is often referred to as a situation of two simple hypotheses. 

A somewhat different situation arises if k = 2, Q; again focuses on a specific 
parameter value, 8 , but Q. simply assigns a prior density p(@) over @ in p(:x | @). 
The rival models then have the representations 


pi(ti..-.-tu) = [] ple, |:) 
1=} 


paltr.ta) = fT] pla, 16)p(@)a9. 
ped 


corresponding to what are usually referred to, within the context of the parametric 
family p(x | 8), as a simple hypothesis and a general alternative. 
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In the contexts of judgements of partial rather than full exchangeability, the 
many versions of the former discussed in Chapter 4 clearly provide considerable 
scope for positing interesting ranges of models in any given application. The exam- 
ples which follow illustrate just a few of these possibilities, expanding somewhat on 
the earlier discussion of model elaboration and simplification given in Sections 4.7.3 
and 4.7.4. 


Several Samples 


Consider the situation of m unrestrictedly exchangeable sequences of zero-one 
random quantities, discussed in detail in Section 4.6.2. We recall that, if x(n;) = 
(ti1,--++Zin,), 7 = 1,...,m, the general representation of the joint predictive 
density for x(7,),...,2(m») is given by 


m Ry 


| TIT] 820 ~ 0) dQ (01, .... Om), 
{o.1)™ 


i=1 j=] 


so that, given a basic assumption of unrestricted exchangeability, alternative models 

are defined by different forms of Q. 
As a stylised illustration of the possibilities, we might consider: 

Qi: assigning probability one to 6, = --- = 0, = 0, say, corresponding to 
the assumed equality of the limiting frequencies of ones in each of the m 
sequences, so that dQ(6)....,9,,) reduces to dQ, (9); 


Q2: assigning probability one to 0; = $1,402 = --: = On = 2, Say, so that 
dQ(6,,..-+4m) reduces to dQ2(¢1, d2); 

Qs: retaining a general, non-degenerate, form dQ(61,....4;,) over the limiting 
frequencies. 


For example, in the context of 0-1 responses in 7m clinical trial treatment 
groups, Q, corresponds, loosely speaking, to the hypothesis that all treatments 
have the same effect; Q2 corresponds to the hypothesis that one of the treatments 
(possibly a “control”) is different from all the other treatments, which themselves 
have the same effects; Q3 corresponds to a general hypothesis that all treatments 
have different effects, any further (non-degenerate) relationships among them being 
defined by the specific form of Q3. 


Structured Layouts 


In Section 4.6.3, we considered triply subscripted random quantities, x;;,, rep- 
resenting the kth of a number of replicates of an observable in “context” 2 € /, 
subject to “treatment” j € J. In particular, we considered the situation where the 
predictive model might be thought of as generated via conditionally independent 
normal 

N(2ijk lu +ajt+ 3; + Yip. T) 
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distributions, together with a prior distribution Q for r and any JJ linearly inde- 
pendent combinations of {a;}. {3;}. {vj},i € J.9 € J. 

As a stylised illustration of alternative modelling possibilities, we might con- 
sider: 


Q;: specifying >;, = 0 for all i. j, together with a non-degenerate specification for 
{ou}. (3)} and ps 

Qz2: specifying 7,; = 0 for all i. j and 3, = O for all j. together with a non- 
degenerate specification for {a} and ju; 


Qs: specifying 4,; = 0. for all i.7 and a, = OQ for all i, together with a non- 
degenerate specification for {;3;} and jy; 


Q: specifying 7,;; = 0.0, = 0, ;3; = 0, forall ¢. j, together with a non-degenerate 
specification for ju. 


The reader familiar with analysis of variance methods will readily identify 
these prior specifications with conventional forms of hypotheses regarding absences 
of interaction and main effects. 


Covariates 


In Section 4.6.4, we discussed a variety of models involving covariates, where 
beliefs about the sequence of observables (a's) were structurally dependent on 
another set of observables (z’s). Given the enormous potential variety of such 
covariate dependent models, it does not seem appropriate to attempt a notationally 
precise illustration of all possibilities. Instead. we shall simply indicate in general 
terms, for each of the cases considered in Section 4.6.4, the kinds of alternative 
models that might be considered. 


Example 6.1. (Bioassay). Alternative models for a single experiment might corre- 
spond to different assumptions about the functiona] dependence of the survival probabilities 
on the dose (for example, logit versus probit). In the case of several separate experiments. 
alternative models might assume the same functional form, but differ in whether or not they 
constrain model parameters — for example, the LDS0°s — to be equal. 


Example 6.2. (Growth curves). Alternative models for an individual growth curve 
might correspond to different assumptions about the functional dependence of the response 
on time (for example, linear versus logistic). In the case of several growth curves for 
subjects from a relatively homogeneous population. alternative models might be concerned 
with whether some or all of the parameters defining the growth curves are identical or differ 
across subjects. 


6.1 Model Comparison 383 


Example 6.3. (Multiple regression). Alternative models in the multiple regression 
context typically correspond to whether or not various regressor variables can be omitted 
from the linear regression form; equivalently, to whether or not various regression coefficients 
can be set equal to zero. 


In the third volume of this work, Bayesian Methods, we shall discuss in detail 
a number of practical applications of this kind. 


Hierarchical Models 


Given the enormous variety of potential hierarchical models and alternative forms, 
we shall just content ourselves with some general comments for one of the specific 
cases considered in Section 4.6.5. 


Example 6.4. (Exchangeable normal mean parameters). 

In Example 4.16 of Section 4.6.5, we considered a case where all the means, j2;,..., Len 
of the m groups of observables with normal parametric models were judged exchangeable, 
and where this latter relationship was modelled as a mixture over a further parametric form, 
reflecting a symmetric judgement of “similarity” for jz)... .. Hn. However, other symmetry 
judgements are possible: for example, that m — 1 of the 1;’s are exchangeable, the other one 
is not, but all are equally likely, a priori, to be the odd one out. This would create a model 
allowing potential “outliers” among the rr groups themselves. (See Section 4.7.3 for further 
development of this idea in a non-hierarchical setting.) 


Confronted with a range of possible models, how should an individual or a 
group proceed? 

From the perspective adopted throughout this book, clearly the answer depends 
on the perceived decision problem to which the modelling is a response. In the re- 
mainder of this chapter, we shall therefore illustrate various of the kinds of decision 
problems that might be considered. The emphasis will be on somewhat stylised, 
typically simple, versions of such problems, in order to highlight the conceptual 
issues. Detailed case-studies, involving the substantive complexities of context 
and the computational complexities of implementation will be more appropriately 
presented in the volumes Bayesian Computation and Bayesian Methods. 


6.1.2 Perspectives on Model Comparison 


To be concrete, let us assume that all the belief models P,i € J, say, under 
consideration for observations x can be described in terms of finite parameter 
mixture representations. Given the specifications of the various densities forming 
the mixtures, the predictive distributions for the alternative models are described 
by 


pi() = ple | M;) = / p(w 0;)p,(8:)d0;, i € I. 
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For mnemonic convenience, from now on we shall denote the alternative models 
by {AJ;.¢ © I} (rather than P;.? € I, as in our previous discussions) and the set of 
these models by M = {AJ;,i € J}. 

Before we turn to a detailed discussion of decisions concerning model choice or 
comparison among { A/;.i € J}, we need to draw attention to important distinctions 
among three alternative ways in which these possible models might be viewed. 

The first alternative, which we shall call the M-closed view, corresponds 
to believing that one of the models {A/,.i € J} is “true”, without the explicit 
knowledge of which of them is the true model. From this perspective, which may 
reflect cither the range of uncertainties within an undecided individual, or the range 
of different beliefs of a group of individuals, the overall model specifies beliefs for 
x of the form 

p(z) = D> P(M,)p(@| M,). 


rel 


with P(AJ;) denoting prior weights on the component models {A/,.i € I}. There 
is, of course, some ambiguity as to what should be regarded as a component model 
(for example, the renormalised mixture of AJ, and AJ2 could itself be regarded as 
a model), but this can be resolved pragmatically by taking {A/;.i € I} to be those 
individual models we are interested in comparing or choosing among. 

But, continuing the discussion of Section 4.8.3 on the role and nature of 
models, when does it actually make sense to speak of a “true” model and hence to 
adopt the .1-closed perspective? 

Clearly, this would be appropriate whenever one knew for sure that the real 
world mechanism involved was one of a specified finite set. One rather artificial 
situation where this would apply would be that of a computer simulation “inference 
game”, where data are known to have been generated using one of a set of possi- 
ble simulation programs, each a coded version of a different specified probability 
model, but it is not known which program was used. 

Beyond such “controlled” situations, it seems to us to be difficult to accept 
the .-closed perspective in a literal sense. However, there may be situations 
where one might not feel too uncomfortable in proceeding “as if” one meant it. 
For example, suppose that a parametric model with a specified parameter has been 
extensively adopted and found to be a successful predictive device in a range of 
applications. Now suppose that a new application context arises and that it is felt 
necessary to reconsider whether to continue with the previous specified parameter 
value or, in this new context, to incorporate uncertainty about the appropriate value. 
Provided we feel comfortable, in principle. with assigning prior weights to these 
two alternative formulations, we can exploit the 4-closed framework. 

However, reality is typically not as relatively straightforward as this. Nature 
does not provide us with an exhaustive list of possible niechanisms and a guarantee 
that one of them is true. Instead. we ourselves choose the lists as part of the process 
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of settling on a predictive specification that we hope will prove “fit for purpose” 
—in the jargon of modern quality assurance. 

But if we abandon the M-closed perspective, how else might we approach 
the very real and important problem of comparing or choosing among alternative 
models? It seems to us that the approach depends critically on whether one has 
oneself separately formulated a clear belief model or not. 

In the former case, the alternative models are presumably being contemplated 
as a proxy because the actual belief model is too cumbersome to implement; how- 
ever, they will still have to be evaluated and compared in the light of these actual 
beliefs. In the latter case, in the absence of an actual specified belief model, it would 
seem intuitively —and we shall see this more formally later—that the alternative 
models have to battle it out among themselves on some cross-validatory basis. 

We now proceed to give these alternative perspectives a somewhat more formal 
description. 

The second alternative, which we shall call the M-completed view, corre- 
sponds to an individual acting as if { A/;,i € I} simply constitute a range of specified 
models currently available for comparison, to be evaluated in the light of the indi- 
viduals separate actual belief model, which we shall denote by AZ. From this per- 
spective, assigning the probabilities {P(M/;), i € J} does not make sense and the 
actual overall model specifies beliefs for a of the form p(x) = p,(a) = p(a | Af;). 
M-completed models, relative to a given proposed range of models A4;,2 € I 
might be adopted for a variety of reasons. Typically, {M;,2 € I} will have been 
proposed largely because they are attractive from the point of view of tractability 
of analysis or communication of results compared with the actual belief model M;. 

The third alternative, which we shall call the M-open view, also acknowledges 
that {M,,7 € I} are simply a range of specified models available for comparison, 
so that assigning probabilities {P(M;),i € I} does not make sense. However, 
in this case, there is no separate overall actual belief specification, p(a:)— perhaps 
because we lack the time or competence to provide it. 

Examples of lists of “proxy models that are widely used include familiar 
ones based on parametric components, corresponding to: regression models with 
different choices of regressors; generalised linear models with different choices of 
covariates, link functions, etc.; contingency table structures with different patterns 
of independence and dependence assumptions. 

The M-open perspective requires comparison of such models in the absence of 
a separate belief model. The {-completed perspective will typically have selected 
the particular proxy models in the light of an actual belief model. For example, 
if the actual belief model is based on non-linear functions of many covariates, 
together with Student probability distribution specifications, the proxy models to 
be evaluated might be various linear regression models with limited numbers of 
covariates and normal probability distribution specifications. 
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6.1.3. Model Comparison as a Decision Problem 


We shall now discuss various possible decision problems where the answer to an 
inference problem involves model choice or comparison among the alternatives in 
M = {M,,i € I}. Some of these only make sense from an M-closed perspective; 
others can be approached from either an M-closed, an M-completed or an M- 
open perspective. Throughout the following development, observed data on which 
decisions are to be based will be denoted by x, and the choice of model 4/,. either 
as an end in itself, or as the basis for a subsequent answer to an inference problem, 
will be denoted by m,, i € J. 

The first decision problem we shall consider involves only the choice of an Af, 
without any subsequent action, so that the utility function has the form u(77;.w). 
where w is some unknown of interest. This decision structure is shown schemati- 
cally in Figure 6.1. 

Provided we feel comfortable, in principle, with assigning prior weights to 
these two alternative formulations, we can exploit the A4-closed framework. 


x ni, 


u(mn;.w) 


Figure 6.1 A decision problem involving model choice only 


It is perhaps not obvious why such a problem would be of interest from an 
M-open perspective. However, from an M-closed perspective, an example of 
an obvious w of interest might be the J; for which, imagining a large future 
sample of observations, y = (y..... ys). P(AL | y) - las s — oc. Recalling 
Proposition 5.9 of Section 5.2.3, w in this case labels the “true model. and the utility 
of choosing a particular model then depends on whether a correct choice has been 
made. 

Whatever the forms of w and u(7m;, w), in the general decision problem defined 
by Figure 6.1, maximising expected utility implies that the optimal model choice 
mi’ is given by 

ii(m” |x) = supu(m, |x) . 
cel 
where 
u(m; |x) = [wr w\plw | e)aw. aed. 


with p(w | x) representing actual beliefs about w having observed x. 
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In the M-closed case, 


p(w |x) = S° p(w] 2) P(M; | 2), 


16] 


where 
P(M,)p(x | Mi) 


~ Sier P(M;)p(@ 1M) 


and p;(w |x) = p(w | M;, a) is given by standard (posterior or predictive) manip- 
ulations conditional on model M;, i € J. We note, in particular, the key role played 
by the quantities { P(A; | x), 2 € J}, which, within the purview of the M-closed 
framework, are the posterior probabilities, given x, of model M;,i € J, being true. 

From the M-completed perspective, we can, at least in principle, obtain 
p(w| ax) and evaluate %(m,|a), i € I, even though this may require extensive 
(Monte Carlo) numerical calculations in specific applications. In this way, one can 
compare the models in M, even though none of them corresponds to one’s own 
assumption regarding the true model. 

From the M-open perspective, nothing can be said in general about the explicit 
form of p(w | a). It turns out, however, perhaps surprisingly, that, at least approxi- 
mately, the same analysis can be carried out in the M-open as in the M-completed 
framework; in other words, one can compare the models in M on the basis of their 
expected utilities without actually having specified an alternative “true” model. We 
shall defer a detailed discussion of this until Section 6.1.6. 


P(M; | x) 


Let us now consider a rather different form of decision problem which first 
requires the choice of model M; from M, which we denote by m,, and then, as- 
suming M; to be the model, requires an answer a,j € J; relating to an unknown 
“state of the world” w of interest. For example, we may wish to predict a future ob- 
servation, or estimate a parameter common to all the models in M. If u(m;,a;,w) 
denotes the utility resulting from the successive choices m, (i.e., model Mj) and 
a;,j € J; (answer to inference question, given M;), when w is the actual “state of 
the world”, the resulting decision problem is shown schematically in Figure 6.2. 


u(m,,a;,w) 


Figure 6.2 A decision problem involving model choice and subsequent inference 
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Systematic application of the criterion of maximising expected utility estab- 
lishes that the optimal model choice is that 73 for which 


(my |x) = sup a(m; | 2). 


iat 


where . 
a(m, |x) = | u(m,;.az.w)p(w | x)dw 


is the expected utility, given 2, of optimal behaviour given model A/;, so that aj is 
obtained from maximising 


[elim a;-00)pi(w |e )cko 


The form p, (w | a) in the above is again given by standard (posterior or predic- 
tive) manipulation conditional on model A/;.7 € J, while the form p(w | a) again 
represents actual beliefs about w given x. 

The explicit form of p(w | a) as a mixture of the p,(w|a), has been given 
above in the M-closed case. 

In the M-completed case, we have also noted above that evaluation of p(w | x) 
and {i(m,|a).i € I} can in principle be carried out, numerically if necessary. 
Detailed analysis for the M-open case will be given in Section 6.1.6. 

From a conceptual perspective, it is important to recognise that different 
choices of w and different forms of utility structure will naturally imply differ- 
ent forms of solution to the problem of model choice. In the next two subsections, 
we shall explore a number of specific cases, in order to underline the general mes- 
sage that coherent comparison of a finite or countable set of alternative models 
depends on the specification (at least implicitly) of a decision structure, including 
a utility function. 

Before proceeding to further aspects of model choice and comparison, how- 
ever, it is worth remarking that, in the above context, it is not necessary to choose 
among the elements of M in order to provide an answer a, to an inference problem. 
If we omit the explicit model choice step, the resulting. different form of decision 
problem is that shown schematically in Figure 6.3. 


wW 
u(a.w) 


Figure 6.3 A decision problem involving terminal decision only 
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In this case, maximising expected utility leads immediately to the optimal 
answer a”, given by 
u(a* |x) = sup &(a| a) 


where 


ti(a* |x) = J x(a,w\pw|2) de. 


with p(w | a:) as discussed above. In the particular case of an M-closed perspective, 
it follows from the posterior weighted mixture form of p(w | a) that, although we 
have omitted the model choice step, model comparison in the light of the data x 
is still being effected through the presence of {P(M;|a),i € I}. In general, if 
we entertain a range of possible models for data x, solutions to decision problems 
conditional on x will always implicitly depend on a comparison of the models in 
the light of the data, even if explicit choice among the models is not part of the 
decision problem. 


6.1.4 Zero-one Utilities and Bayes Factors 


In this section, we confine attention to the M-closed perspective and consider first 
the problem of choosing a model from M, without any subsequent decision, when 
the “state of the world” of interest is defined to be the “true” model, M;, so that 
assuming a future sample y = (y...., Ys), P(Af; |y) — 1 as s — 00. From the 
M.-closed perspective, the problem, stated colloquially, is that of choosing the true 
model. 

In this case, a natural form of utility function may be 


ulmw)= {1 fe=M 
P= )0 if WAM. 


It is then easily seen from the analysis relating to Figure 6.1 that 


ese 1 if w= M,; 
Py “V0 if wAM, 


and 
plw |e) = 4 5 hl). : aie 
The expected utility of the decision m; (choosing J;), given x, is hence 
a(m, |) = f u(mn,.w)p(w| 2) 
=P(M,\x), ie€l. 


The optimal decision is therefore to “choose the model which has the highest 
posterior probability”. 
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Bayes Factors 
Less formally, suppose that some form of intuitive measure of pairwise comparison 
of plausibility is required between any two of the models {AJ;.i € I}. The above 
analysis suggests that \J;, Af, may be usefully compared using the posterior odds 
ratio, 

P(M;|x) _ p(x| AM) ‘ P(M,;) 

P(A, |x) = pla|af,) P(AL) 


where, for example, 


plea) = f pix 6.)p(6;)46, 
In words, the above comparison can be described as 
“posterior odds ratio = integrated likelihood ratio x prior odds ratio”. 


making explicit the key role of the ratio of integrated likelihoods in providing the 
mechanism by which the data transform relative prior beliefs into relative posterior 
beliefs, in the context of parametric models, 

The fundamental importance of this transformation warrants the following 
definition, apparently due to Turing (see, for example, Good, !988b). 


Definition 6.1. (Bayes factor). Given two hypotheses H;, H; corresponding 
to assumptions of alternative models, M,;, M). for data x, the Bayes factor in 
favour of H; (and against H;) is given by the posterior to prior odds ratio. 
p(x | M,) P(AL |x) P(AM,) 
nun Be {e/a 
p(x | AL,) P(M, |x) P(AL,) 

Intuitively, the Bayes factor provides a measure of whether the data x have 
increased or decreased the odds on H; relative to H,. Thus. B,;(x) > 1 signifies 
that H; is now more relatively plausible in the light of x: B;;(a) < 1 signifies that 
the relative plausibility of H; has increased. 

Good (1950) has suggested that the logarithms of the various ratios in the above 
be called weights of evidence (a term apparently first used in a related context by 
Peirce, 1878), so that log B;;(x) corresponds to the integrated likelihood weight 
of evidence in favour of A/; (and against A/,). On this logarithmic scale, the prior 
weight of evidence and log B; (x) combine additively to give the posterior weight 
of evidence. 

In Section 6,1.1, we noted the extremely simple forms of predictive models 
which result when beliefs not only concentrate on a specific parametric family of 
distributions. but also identify the value of the parameter. An alternative set of such 
models, A/;, i € I, then just corresponds to the specifications {p,(z|0,). i € 7}. 
and the integrated likelihood ratios reduce to simple ratios of likelihoods. 
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Hypothesis Testing 


The problem of hypothesis testing has its own conventional terminology, which, 
within the framework we are adopting, can be described as follows. Two alternative 
models, Af,, Mf are under consideration and both are special cases of the predictive 
model 


p(x) = ii p(x | @)aQ(8), 


with the same assumed parametric form p(x | @), @ € ©, but with different choices 
of Q. If, for model Af;, Q; assigns probability one to a specific value, 0;, say, 
the model is said to reduce to a simple hypothesis for @ (recalling that the form 
p(x |@) is assumed throughout). If, for model Af;, Q,; defines a non-degenerate 
density p;(@) over ©; C O, the model is said to reduce to a composite hypothesis 
for 6. If a simple hypothesis is being compared with a composite hypothesis, so 
that 0, = © — {6;}, the latter is called a general alternative hypothesis. 

In the situation where the “state of the world” of interest, w, is defined to be 
the true model Af,, we can generalise slightly the zero-one utility structure used 
earlier by assuming that 


u(m,,w) = —li;, w=M; i=1,.2, j=1,2. 


with J), = lg2 = Oand ly2,l2) > 0. Intuitively, there is a (possibly asymmetric) loss 
in choosing the wrong model, and there is no loss in choosing the correct model. 

Given data x, and using, again p;(w|x) = Lif w = M; and 0 otherwise, the 
expected utility of m,; is then easily seen to be 


a(m, |x) = — [la P(A, | x) + ligP(Mo | x)). 
so that 
u(m, |x) < i(m2|x) iff lep(M2|x) > lap(M, |x). 
We thus prefer Af, to Af), if and only if 


P(M|2) _ a. 
P(Mz|x) ~ ly 


revealing a balancing of the posterior odds against the relative seriousness of the 
two possible ways of selecting the wrong model. In the symmetric case, lj. = la), 
the choice reduces to choosing the a posteriori most likely model, as shown earlier 
for the zero-one case. 

The following describes the forms of so-called Bayes tests which arise in 
comparing models when the latter are defined by parametric hypotheses. 
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Proposition 6.1. (Forms of Bayes tests). In comparing two models, A1,, Mo, 
defined by parametric hypotheses for p(x | @), with utility structure 


u(mj.w) = —L;. wh. §=1.2. f1.2. 
with [= la = Oand ly. ly, > 0, MLa is preferred to AM,. ifand only if 


liz PAR) 


AST POR): 

where: 

p(x | 41) ; : 
Byo(x) = ———— (simple versus simple test), 

12(z) pe |) Pp Pp 

p(x | 91) : : 

By,(z2) = ————————_ (simple versus composite test), 

l=) = Fe) @)p(ade |" : 

Jo, Pla | @)p1(8)0 | | 

By2(x) (composite versus composite test). 


© fo, P(® | @)p2(8)d8 
Proof. The results follow directly from the preceding discussion. q 


The following examples illustrate both general model comparison and a spe- 
cific instance of hypothesis testing. 


Example 6.5. (Geometric versus Poisson). Suppose we wish to compare the two 
completely specified parametric models, Negative-Binomial and Poisson. defined for con- 
ditionally independent .r;...... v,,. by 


Myo Nb(0;[@).1). AL: Pate, |@2). f= 1... n. 
The Bayes factor in this case is given by the simple likelihood ratio 


TI’... Nb(x, 141.1) ay (1 - #4)" 
By(#| 61.62.) = SS 
1-8) = TT Paces) ae EL yey 


Suppose for illustration that @, = +, @, = 2 (implying equal mean values E[.r] = 2 for both 
models); then, for example, with n = 2,.r, = ro = 0, we have Byo(x) = ¢'/9 = 6.07, 
indicating an increase in plausibility for Af,. whereas with n = 2.7, = 1. = 2. we have 
By (x) = 4e+/729 = 0.30, indicating a slight increase in plausibility for Als. 

Suppose now that 4). 9. are not known and are assigned the prior distributions 


pi(:) = Bet@, |ay.3,). pols) = Ga(@s levy. 42). 
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whose forms are given in Section 3.2.2 (where details of Nb(.c; | 1,0,) and Pn(z; | 42) can 
also be found). It follows straightforwardly that 
Tai + Ai) + Dr) nary 5. ts 
r,...2, | M 9 ~"(1 ~ 6)" 1-1 G9 
piece N= Begala 
_ F(a + 3,) T(n4+a,)P (nF + Bi) | 
* Fai)F (A) T(n +nF+a,+ fi) 


and that 
P(T1.-- +4 an | M2) = By ie ORrre2—| (5 32129, 
P(a2) Ti 2! Jo 
_ _ T(nF + a2)? 1 
© P(ea2)(n + 2) #02 TT, 
so that 
Tia; + 4) P(n + a, )P (nF + 3;) P(ag)(n + 32) nftag 
Bi; eee oe ;! 
we) = Tea) Ternttata) Toeraee L199 
We further note that 
1 
E(z; | AN) = [ E(z; | M,.6;)Be(@, Jay, 3,)d6, 
0 
'a-@ 
mas [ O- 8) Beg, [ay. 9,)dA, 
0 
_ May +h) Ma-1M+1)__ & 
Pay )P(31) T'(ay + A) a] 
and 


x 
E(2,|Mp) = J E (2, | Mz, 82) Ga(0. | a, 32)d02 


-[ 82 Ga(62 | a2, 32) = * 


3,82 imply the same means for the two 


so that prior specifications with (a, — l)a, = 
predictive models. 


Table 6.1 Dependence of B,,(x) on prior-data combinations 
a, = 2,3, =2 a; = 31. 8, = 60 a, = 2,3, =3 
O2 = 2, =1 a2 = 60, 2 = 30 Oy = 3,8, =2 


x, =22=0 2.70 5.69 0.80 
X= 72 =2 0.29 0.30 0.49 
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As an illustration of the way in which the prior specification can affect the inferences. 
we present in Table 6.1 a selection of values of B,2(a:) resulting from particular prior-data 
combinations. 

In the first two columns, the priors specify the same predictive means for the two models. 
namely E'{xr | Af,| = 2, but the priors in the second column are much more informative. In the 
final column, different predictive means are specified. Column 2 gives Bayes factors close to 
those obtained above assuming ,. 6: known, as might be expected from prior distributions 
concentrating sharply around the values 0; = 4, and @, = 2. However, comparison of the 
first and third columns for zr; = £2 = 0 makes clear that, with small data sets, seemingly 
minor changes in the priors for mode! parameters can lead to changes in direction in the 
Bayes facior. 


The point made at the end of the above example is, of course, a general one. 
In any model comparison, the Bayes factor will depend on the prior distributions 
specified for the parameters of each model. That such dependence can be rather 
striking is well illustrated in the following example. 


Example 6.6. (Lindley’s paradox). Suppose that fora = (4)...... r,,) two alternative 
models Af. Af, with P(Af;) > 0.7 = 1.2. correspond to simple and composite hypotheses 
about yc in N(z, |. A) defined by 


u 


Ah: p(x) = I[\@ Jitu. A), fy. A Known, 


i=} 
M, : p2(z) = [Tle La, A)N (ge ] pr Ay de. py. Ayr known. 
1-1 


In more conventional terminology, .c,..... £, are a random sample from N(x | js. A). with 
precision 4 known; the null hypothesis is that 4 = jo, and the alternative hypothesis is that 
p # jy, with uncertainty about yz described by N (jt | jz;. Ay). 
Since T = n7' }0"_, x, is a sufficient statistic under both models, we easily see that 
N(F | po. nA) 
LN werd) N (| po. Arde 


aye may he? exp {4 (Ar! + (nay!) = an)? 
=( AI ) exp { in A(E — yo)? } 


By(xz) = 


It is easily checked that, for any fixed ¥, Bys(x) -+ 3c as A, — 0, so that evidence in 
favour of Af, becomes overwhelming as the prior precision in Af, gets vanishingly small, 
and hence P(Af,|a) — 1. In particular, this is true for 2 such that A!’?|F — jayi is 
large enough to cause the “null hypothesis” to be “rejected” at any arbitrary, prespecified 
level using a conventional significance test! This “paradox” was first discussed in detail by 
Lindley (1957) and has since occasioned considerable debate: see Smith (1965), Bernardo 
(1980), Shafer (1982b), Berger and Delampady (1987), Moreno and Cano (1989), Berger 
and Mortera (1991a) and Robert (1993) for further contributions and references. 
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A model comparison procedure which seems to be widely used implicitly in 
statistical practice, but rarely formalised, is the following. Given the assumption of 
a particular predictive model, {p(x | @), p(@)}, @ € ©, aposterior density, p(O | x), 
is derived and, as we have seen in Section 5.1.5, may be at least partially summarised 
by identifying, for some 0 < p < 1, a highest posterior density credible region 
R,(ax), which is typically the smallest region such that 


| p(|x)d@ = p. 
Rp(z) 


Intuitively, for large p, R,(a) contains those values of @ which are most plausible 
given the model and the data. Conversely, R(x) consists of those values of 8 
which are rather implausible. 

Now suppose that, given a specified p and derived R,(2), one is going to assert 
that the “true value” of 6 (i.e., the value onto which p(@| y) would concentrate as 
the size of a future sample tended to oc) lies in R,(a). Defining the decision 
problem to be the choice of p, so that the possible answers to the inference problem 
are in A = {0, 1], with the state of the world w defined to be the true @, a value 
a, = p has to be chosen. An appropriate utility function may be 


f(p) for 8 € R,(x) 
u(a,,6) = ee —p) for@e Re (2) 


where f and g are decreasing functions defined on [0, 1]. Essentially, such a utility 
function extends the idea of a zero-one function by reflecting the desire for a 
“correct” decision, but modified to allow for the fact that choosing p close to one 
leads to arather vacuous assertion, whereas a correct assertion with p small is rather 
impressive. 

The expected utility of choosing ap = p is easily seen to be 


u(ay) = pf(p) + (1 — p)g(1 — p), 


from which the optimal p may be derived for any specific choices of f and g. 
We note that if f = g, the unique maximum is at p = 0.50, so that it becomes 
optimal to quote a 50% highest posterior density credible region. If, for example, 
f(p) =1-p, g(1 — p) = {1 — (1 — p)|? = p’, the resulting optimal value of p 
is 1/ V3 = 0.58, so that a 58% credible region is appropriate. More exotically, if 
f(p) = 1 — (2.7)p’, g(1 ~ p) = (1 — p)—1, the reader might like to verify that a 
95% credible region is optimal. 


6.1.5 General Utilities 


Continuing for the present with the (A/1-closed) hypothesis testing framework, the 
consequences of incorrectly choosing a model may be less serious if the alternative 
models are “close” in some sense, in which case utilities of the zero-one type, which 
take no account of such “closeness”, may be inappropriate. 
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One-sided Tests 


We shall illustrate this idea, and forms of possibly more reasonable utility functions, 
by considering the special case of 9 € © C R, with parametric form p(a | @) and 
models AJ,, Af defined by 


Pi(x|9) = p(x} 6). 8 € O; = {8: 8 < OH} 
po(z|O) = p(z|0). 9 € Oo = {0: 6 > OH} 


for some 9) € O. The models thus correspond to the hypotheses that the parameter 
is smaller or larger than some specified value 6. 

It seems reasonable in such a situation to suppose that if one were to incorrectly 
choose Ay (9 > 69) rather than A/, (@ < 4), in many cases this would be much 
less serious if the true value of @ were actually 0 — = than if it were @) — 100z. 
say, for ¢ > 0. Such arguments suggest that, with the state of the world w now 
representing the true parameter value @. we might specify a utility function of the 


form 
0 for 0 € O,, 


u(m;,w) = u(m,.8) = { —1,(@) for@ € OF, 


for i = 1,2 where /,, /2 are increasing positive functions of (9 — A) and (0 — 9), 
respectively. The expected utility of the decision corresponding to 7, (i.e., the 
choice of AJ;) is therefore given by 


&i(m, |e) = — [ 1,(0)p(0 | x)dd. 
ee 


where (a |0) (0) 
_ _v(a|9)p(9) 
PUO|&) = Fw |6)p(0)d0 


The optimal answer to the inference problem is to prefer AJ, to Al2 if and only if 
Gi(m, |x) > am |x). 


with explicit solutions depending, of course, on the choices of 1), /2, and the form 
of p(@| a), as illustrated in the following example. 


Example 6.7. (Normal posterior; linear losses). 
If1,(9) = 0 ~ 6), 1.(@) = &(Oy — #). with k reflecting the relative seriousness of “over- 


estimating” by choosing model A/,, and p(@| i), given x = vry......0,,. is N(O| y,.A,,). 
say. then we have 


um, [x)= - [ (9 — By) N(O | y,,.A,) dO 
Pi >thy 


= -\,, | ly, [A,, ' 7(My ~ fad). 
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where 
Y(t) = N(t{0,1) -tf N(s|0.1)ds. 
t 
and 
u(m,{z) = -« | (0 — ON (9 | teu. An) dO 
958) 
= —kN Wy [ATY?(6y — wn) - 

where 


t 
Y(t) = N(t|O. 1) + tf N(s|0.1)ds. 
0 
It is therefore optimal to prefer Af, to M2 if and only if 
kW, (X57 (By — pn)] > G1 [Az"? (Bo — wn] - 


In the symmetric case, & = 1, it is easily seen that this reduces to preferring Af, if and only 
if 2, < , as one might intuitively have expected. For references and further discussion of 
related topics, see DeGroot (1970, Chapter 11), and Winkler (1972, Chapter 6). 


Prediction 

Moving away now from model comparisons which reduce to hypothesis tests in 
parametric models, let us consider the problem of model comparison or choice, 
given data x, in order to make a point prediction for a future observation y. 

The general decision structure is that given schematically in Figure 6.2, where, 
assuming real-valued observables, m, corresponds to acting in accordance with 
model M,, a;, 7 € J; denotes the choice, based on M,;, of a prediction, %;, for a 
future observation y, and we shall assume a “quadratic loss” utility, 


ulm, 9iy)=—-(G-y), el. 


We recall from the analysis given in Section 6.1.3 that the optimal model choice is 
m*, given by 


am’ |x) = sup 4 ulen;, 9%, v)p(yla)dy, 


where 9% is the optimal prediction of a future observation y, given data x and 
assuming model M],; that is, the value y which minimises 


j (8 — v)’plyle)dy, 
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where p;(y|2x) is the predictive density for y given model A/;. It then follows 
immediately that 


oj = [pluinrdy = Ely|x. Al). 


the predictive mean, given model A/;, so that 


| ulm 97 Wply|z)dy = — ‘ (af —yPplyle)dy. be ZT. 


Completion of the analysis now depends on the specification of the overall actual 
belief distribution p(y|a) and the computation of the expectation of (y’ -- y)°. 
i € J, with respect to p(y | a). 

Again, in the A4-completed case there is nothing further to be said explicitly: 
one simply carries out the necessary evaluations, using the appropriate form of 
p(y|a), by numerical integration if necessary. 

In the M{-open case, the detailed analysis of the problem of point prediction 
with quadratic loss will be given in Section 6, 1.6. 

In the M-closed case, we have 


p(y|x) = >- AM, | )p,(y|2). 
we] 


and, after some rearrangement, it is easily seen that 


[oi ~ y)’p(y|)dy = > p(A, | x) ) fa bi — G5 + 95 — yey |e)dy. 


gel 


which reduces to 


>> pM, ja)Viy| Al). a} + (9; ~ 9) p(M, |x). 


jel yell 


The first term does not depend on ?, and the second term can be rearranged in the 
form 


(gf - 7) + D0 - 7 PPAL |). 


ged 


where yg’ is the weighted prediction 


y= So yj PAL, | a). 


ded 
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The preferred mode! MM; is therefore seen to be that for which the resulting pre- 
diction, #7, is closest to 9*, the posterior weighted-average, over models, of the 
individual model predictions. If k = 2, it is easily checked that the preferred 
model is simply that with the highest posterior probability. 

If we wish to make a prediction, but without first choosing a specific model, 
it is easily seen that the analysis of the problem in terms of the schematic decision 
problem given in Figure 6.3 of Section 6.1.2 leads directly to 7* as the optimal 
prediction. 

Clearly, the above analyses go through in an obvious way, with very few 
modifications, if, instead of prediction, we were to consider point estimation, with 
quadratic loss, of a parameter common to all the models. More generally, the 
analysis can be carried out for loss functions other than the quadratic. 


Reporting Inferences 


Generalising beyond the specific problems of point prediction and estimation, let us 
consider the problem of model comparison or choice in order to report inferences 
about some unknown state of the world w. For example, the latter might be a 
common model parameter, a function of future observables, an indicator function 
of the future realisation of a specified event, or whatever. 

A major theme of our development in Chapters 2 and 3 has been that the 
problem of reporting beliefs about w is itself a decision problem, where the pos- 
sible answers to the inference problem are the consists of the class of probability 
distributions for w which are compatible with given data. The appropriate utility 
functions in such problems were seen to be the score functions discussed in Sections 
2.7 and 3.4. This general decision problem is thus a special case of that represented 
by Figure 6.2, where, given data x, m, represents the choice of model Af;, the 
subsequent answer aj, 7 € Jj, to the inference problem is some report of beliefs 
about w, assuming Af,, and the utility function is defined by 


u(m,,aj.w) = uj(q(-| 2). w), 
for some score function u;, and form of belief report, g;(- | a), about w, correspond- 
ing to d;, 7 € Ji. 
If p;(- | a) is the form of belief report for w actually implied by m, and if u; 


is a proper scoring rule (see, for example, Definition 3.16) then it follows that the 
optimal a;, 7 € J, must be a} = p,(-| a) and that 


u(m;,a7,w) =u(p(-|x).w), ie Tl. 


If, moreover, the score function is local (see, for example, Definition 3.18), we have 
the logarithmic form 


u(m;,aj,w) = Alogp,(w|az)+ Bw), te€7, A>, 
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for a > O and B(w)) arbitrary, in accordance with Proposition 3.13. The expected 
utility of 72; is therefore given by 


it(m;|x2) = KG log pj(w |x) + B(w)} p(w | x)dw. 


and the preferred model is the A/; for which this is maximised over i € J. 
Comments about the detailed implementation of the analysis in the -open 
case are similar to those made in the previous problem. 
For M-closed models, we have the more explicit form 


ai(m,; |x) =a S> (AG |x) [owe | a) log p,(w | a)dw + i B(w)p(w | x)dw. 
jel 

which, after straightforward rearrangement, shows that the preferred i, is given 

by minimising, over i € /, 


S_ (AL |x) xc | x) log aed du. 


o p,(w |x) 
the posterior weighted-average, over models, of the logarithmic divergence (or 
discrepancy) between p;(w | x) and each of p,(w|x).j #7 € I. 
If, instead, we were to adopt the (proper) quadratic scoring rule (see. for 
example, Definition 3.17), we obtain, ignoring irrelevant constants, 


u(m,; |x) x | {2.w12) - J vitwlyda} noo] 2) ds 


so that, after some algebraic rearrangement. in the case of /-closed models the 
preferred MM; is seen to be that which minimises, over i € /, 


So pla, |2) f pyle le) tp, (| 2)} ~ Ftp.) de 
gel 
where 


f {p(w |2)} = 2pjlw |x) - i: Pw |a)du. 


Comparison of the solutions for the logarithmic and quadratic cases reveals that if. 
for arbitrary f, 


5{q(w) | p(w)} = p(w)LF p(w) } ~ F{q(w) fdas. 


defines a discrepancy measure between p and gq, both may be characterised as 
identifying the AJ; for which 

DY pL | w)d{p,(w | x) | p(w | x)} 

jel 


is minimised over i € [, the differences in the two cases corresponding to the form 
of f (logarithmic or quadratic, respectively). 
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Example 6.8. (Point prediction versus predictive beliefs). 

To illustrate the different potential implications of model comparison on the basis of 
quadratic loss for point prediction versus model comparison on the basis of logarithmic score 
for predictive belief distributions, consider the following simple (A/-closed) example. 

Suppose that alternative models Af,, Af. for x = (x1,....27,,) are defined by: 


My ple) = fT] NGI AIL wo Aoddiy J = 1.2, 
t=1 


with A,, Az. #49. An known: we are thus assuming normal data models with precisions A), A2, 
respectively, and uncertainty about j described by N( | ty, Ay) in both cases. 

Now, given x, consider two decision problems: the first problem consists in selecting 
a model and then providing a point prediction, with respect to quadratic loss, for the next 
observable, 2,,,;; the second problem consists in selecting a model and then providing a 
predictive distribution for x,-,, with respect to a logarithmic score function. 

For the first problem, straightforward manipulation shows that the predictive distribu- 
tion for c,.; assuming model M, is given by 


; ‘ Ayn te a CoE nL bas Ao/ A; 
P(2n41 | Mf;, x) > Dy (Lns1 |z)=N (sun pd 2,(—ET2¥-)) , 
where ie ee 
-) _ Avo + RAT 
Haj) = Xo + nd; 


so that, corresponding to the analysis given earlier in this section, model M; leads to the 
prediction p,,(j), j = 1,2, and the preferred model is A/, if and only if 


P(M, |2) 


P(M,|z) 


To identify these posterior probabilities, we note that, if s? = n™'Z(z; — £)*, 


p(z|Mi) _ p(Z,s?|M) 

p(z| M2) ~ p(Z, s? | M2) 
— AxXF-1(NA 8?) N(Z | uo, Ao + NAL) 
~ Yax31 (225?) N(E] Ho, Ao + MA2) 


By2(z) = 


which, for small Ay, is well approximated by 


(re- 4) /2 2 
(=) exp {=F as - ry} : 


The posterior model probabilities are then given by 


By2(z)pi2 ‘ i 


BEE ie 1+ By(z)py2 a a lg 1+ Byo(x)pre ; 
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where pj. = P(AL,)/P( Af). Model Af, is therefore preferred if and only if 
log|Bi2(xz)py2| > 0. 


In the case of equal prior weights, p,.. = | and, assuming small ,, if we write the condition 
in terms of the model variances a — Ned = 1.2. we prefer A/; when 


(>, #2 202 2 hn 
U(x, - F) é a3 {log a3 — logay: | 


> > 
n—- 1 o~ OF 


Noting that the left-hand side is an intuitively reasonable data-based estimate of the model 
variance, we see that model choice reduces to a simple cut-off rule in terms of this estimate. 

For the second decision problem, the logarithmic divergence of p(.r,,.; | A/).a) from 
P(to41 | Ale, Z) is given, for small Ay, by 


dp = [s (+. TF. Az ie ) log ¥ (rel Aa(ndtn + 0) ) dry. 
. N (rae [PLAY (nin a 1)) ) 


2k (a 
= 5 {ne + (4 1)} : 


with a corresponding expression for 6.;. The general analysis given above thus implies that 
model A/, is preferred if and only if 


P(M, |x)bx, > P(A | @)b.. 


i.e., if and only if 
P(AL |x) by 


(rather than > 1, as in the point prediction case). Note, incidentally, that should it happen 
that P( AY, | a) = P(AL,| 2), model AJ,. would be preferred if and only if é,. < 62). which 
happens if and only if A; > Az. Intuitively, all other things being equal, we prefer in this 
case the model with the smallest predictive variance. 

To obtain some limited insight into the numerical implications of these results, consider 
the case where 0} = 1,03 = 25.n = 4. P(AL)) = P(Mb) = 4 and s° = 3. which gives 
By, = 0.394, so that P(A, | x) = 0.28. PAL i x) = 0.72. Using the point prediction with 
quadratic loss criterion, we therefore prefer A/,. However, 6). = 1.129 and 42, = 10.31, so 
that if we want to choose a predictive distribution in accordance with the logarithmic score 
criterion we prefer A/,, since (0.28)/(0.72) > (1.129)/(10.31). However. if s? = 4, the 
reader might like to verify that \/, is preferred under both criteria (By. = 0.058, implying 
that Pr( As; | x) = 0.055). 
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6.1.6 Approximation by Cross-validation 


For the general problem of model choice followed by a subsequent answer to an 
inference problem, the analysis based on Figure 6.2 implies that the optimal choice 
of model from M is the M; for which 


[uo a;,w)p(w | x)dw 


is maximised over i € I, where a* denotes the optimal subsequent decision given 
M,. In the M-closed case, we have seen that the mixture form of p(w | x) enables 
an explicit form of general solution to be exhibited; in the 4-completed case, we 
have noted that the solution is in principle available, given appropriate computation. 

We turn now to the case of model comparison within the M-open framework. 
What can be done to compare the values of A/;, i € J, as proxies for an actual 
belief model which itself has not been specified, so that p(w | a) is not available? 

We shall illustrate a possible approach to this problem by detailed consider- 
ation of the special case where w = y, a future observation, for which a point 
prediction with respect to quadratic loss, or a predictive distribution with respect 
to logarithmic or quadratic score, is required. 

First, we note that, in all these cases, the expected utility of the choice M,, 
i € I, has the mathematical form 


/ u(m;,a?,y)p(y|x)dy = / fly.e)p(y |w)dy, 


for some function f; of y and a, depending on i, whose form can be explicitly 
identified. For example, for point prediction with quadratic loss, we have 


fily.2) = ~{Ely| Mi, x] — y}’: 


for a predictive distribution with logarithmic score function we have, ignoring 
irrelevant terms, 


fily, x) = log p(y | Mi, x); 


and with a quadratic score function we have 
fily, ®) = 2p(y| Mi, x) - [rw | M,, x)dy. 


Secondly, we note that there are 2 possible partitions of x = x, = (z1,..-52n) 
into Z, = [%n-1(j), zi], j = 1,...,n, where x,-1(j) = Lp — {x;} denotes z, 
with x; deleted, and that, if n is reasonably large, and the x's are exchangeable, 
each such partition effectively provides x,,_;(j) as a “proxy” for x and x; as a 
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“proxy” for y. If we now randomly select & from these 7 partitions, a standard law 
of large numbers argument suggests that, as n.k —+ 9c, 


fom. 05- youl) dy — EDA Lj-Ln-1(7))| S 0. 


so that the expected utilities of A/;, ¢ € I, . can be compared on the basis of the 


quantities ic 
pL Alam ij)), @ed. 


In the case of point orediction, if y is a future observation and g*(j) denotes the 
value of E{y| M/;,x] when x is replaced by x,,-;(j). this approximation implies 
that we minimise, over i € I, 


which is an average measure, using squared distance, of how well A/; performs 
when, on a leave-one-out-at-a-time basis, it attempts to predict a missing part of 
the data from an available subset of the data. 

In the case of a predictive distribution with a logarithmic score, we maximise, 
over? € 1, if 
k Y= log p(z; | Mj. tui (J). 

J] 

which can be regarded as an average measure based on the logarithm of the inte- 
grated likelihood under model A/,, and can be conveniently rewritten, for compu- 


tational purposes, in the form 
k 


log p(x | M,) - = > logn( #y—1(j)| Ah). 
jel 
In the case of comparing two models, Af, M2, this criterion can be given 
an interesting reformulation. Under the logarithmic prediction distribution utility, 
and writing p;(y| a) = p(y| Adj. a), we can rearrange the criterion to see that we 
prefer model Ad, if 


Pily|@) 
Joe mly x) P| ®) dy > 0. 
where, however, in this M-open perspective, p(y | a) is not specified. But, as we 
saw above, we can form n partitions xz, = (x, _1(j)..r,;] such that z,_)(j) = 
x, — ©, is, for large n, a “proxy” for x and x, is a“proxy” for y. It follows that, 
if we randomly select & of these partitions, the quantity 


k 
1 Pi(x, | £n- i(/) -1U)) 
Bs po 


po(t,|Bna1(4)) 
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provides a (consistent, as n —» co) Monte Carlo estimate of the left-hand side of 
the model criterion above. But this, in tum, can be rewritten so that the criterion 
implies preferring model M!], if 


Wk 
II einen Bil Ans at =T[[ (Bio (xj.2n- iG)" >1, 


j=l P2 (x; |r (9 ) j=l 


for j =1,...,k, where By2(x,;,@,—-1(j)) denotes the Bayes factor for Af; against 
Mp. based on the versions 


{pi (x; | 9:1). 7: (O; |aen-1(3))} 


of M,, Mz. We recall from Section 6.1.4 the role of Bayes factors based on the 
versions {p;(a | 8;), p:(0;)} of Af, Adz, in the context of zero-one loss functions and 
the M4-closed perspective. Although there are clear differences here in formulation 
(M-open versus M-closed, log-predictive utility versus 0-1 loss), it is interesting 
to note the role played again by the Bayes factor. One interesting difference is the 
following (Pericchi, 1993). In Section 6.1.4, the Bayes factor is evaluating Af,, M2 
on the basis of the models’ ability to “predict” a given no data (beyond what has 
been used to specify p;(@;)). In contrast, in the above we are taking a geometric 
average of Bayes factors which are evaluating A4,, Az on the basis of the models’ 
ability to predict one further observable, given n — 1 observations. The former 
situation puts the emphasis on “fidelity to the observed data”; the latter puts the 
emphasis on “future predictive power”. 

These kinds of approximate performance measurements for comparing models 
could obviously be generalised by considering random partitions of z involving 
leave-several-out-at-a-time techniques. We shall not develop such ideas further 
here—apart from giving one further interesting illustration in Section 6.3.3—but 
merely note that the above approximation to the optimal Bayesian procedure leads 
naturally to a cross-validation process, which results in a preference for models 
under which the data achieve the highest levels of “internal consistency”. Thus, 
for example, in both the quadratic loss and Jogarithmic score cases, if under model 
M; there are x; which are “surprising” in the light of x, 1 (7), thus leading to large 
squared distance terms or small log-integrated-likelihood values, respectively, the 
performance measure will penalise M;. 

Model choice and estimation procedures involving cross-validation (some- 
times called predictive sample reuse) have been proposed by several authors, from 
a mainly non-Bayesian perspective, as a pragmatic device for generating statistical 
methods without seeming to invoke a “true” sampling model: see, for example, 
Stone (1974) and Geisser (1975) for early accounts and Shao (1993) for a recent 
perspective. The above development clearly establishes that such cross-validatory 
techniques do indeed have an interesting role in a Bayesian decision-theoretic 
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setting for approximating expected utilities in decision problems where a set of al- 
ternative models are to be compared without wishing to act as if one of the models 
were “true”, and in the absence of a specified actual belief model. 


Example 6.9. (Lindley’s paradox revisited). 

In Example 6.6, we considered the case of two alternative models, A/,, \/., for x = 
(pee. r,,), corresponding to simple and composite hypotheses about j: in V(r, j1. A) and 
defined by: 


Ah: p(x) = I Nr, |g. A). jin. A known: 


rex) 


Aly: jo(x) = [Tse [pee AVN GO| per Arde. py. Ay. A known. 
rl 


The analysis given in Example 6.6 was within the .-closed context with P(AJ,) > 0. 
i = 1.2, and it was shown that, as A, — U0, P( AJ, | a) — 1 for any fixed a. It follows from 
results given in Sections 6.1.3 and 6.1.4 that, as A, — 0, Af; would be the preferred model 
under either zero-one utility, or quadratic loss utility for point prediction (since in this latter 
case, the criterion reduces to the comparison of posterior probabilities when just two models 
are being compared). 

We shall now reconsider the case of quadratic loss for point prediction in the -open 
context. 

First, we note that, given 2, the optimal prediction of a future observation, y, under 
AJ, is just yj = jay, whereas (making appropriate notational changes to the results given in 
Example 5.10) under AJ. the optimal prediction is 

Y= Ma = Au tee =(h-w, in ty? 

where uw, = 2A(A, + nA)~!. Secondly, from the cross-validation approximation analysis 
given above we see that A/, is preferred to AJ, if and only if. based on / random partitions 
of x into x, and x,,_:(j). 


A k 
YH > ay > oy {a = wy) + Why Py. i(a)} -_ he 
iv i=l 


where F,-:(j) = F+(n — 1) '(¥ — 4;) is the mean of the sample x with .r, omitted. 
Intuitively, Af) will be preferred if the posterior mean on average does better as a predictor 
than ji. In particular, if A; — 0. and # = 1, an approximate analysis shows that AJ, is 
preferred to Af. if and only if 


4 e 


Yi ae a, y < Gra iJ) a ry 
ged 


sel 
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This is easily seen (Pericchi, 1993) to be equivalent to preferring A/, if, and only if, 


n( po — E)? é an-1 
wya(tj -—FP/(n-1) ~ n-1 Pee 


which, under A4,, is equivalent to rejecting AZ, if a Snedecor F;,,,-; random quantity exceeds 
the value 2. See Leonard and Ord (1976) for a related argument. 

This result provides a marked contrast to that obtained in Example 6.6 and makes 
clear that, even given the same data and utility criterion, preferences among models in M 
may differ radically, depending on whether one is approaching their comparison from an 
M-closed or M-open perspective. 


6.1.7 Covariate Selection 


We have already had occasion to remark several times that our emphasis in this 
volume is on concepts and theory, and that complex case-studies and associated 
computation will be more appropriately discussed in subsequent volumes. That 
said, it might be illuminating at this juncture to indicate briefly how the theory we 
have developed above can be applied in contexts which are much more complicated 
than those of the simple, stylised examples on which most of our discussion has been 
based. To this end, we shall consider the important problem of model comparison 
which arises when we try to identify appropriate covariates for use in practical 
prediction and classification problems. 

To fix ideas, consider the following problem. Some kind of decision is to 
be made regarding an unknown state of the world w relating to an individual unit 
of a population: for example, classifying the, as yet unknown, disease state of a 
specific patient, or predicting the, as yet unknown, quality level of the output from 
a particular run of an industrial production process. Possible predictive models 
are to be based, for various choices of m, on covariates y;(Z),...,Ym(Z), which 
are themselves selected functions of z = (z),...,2s), representing all possible 
observed relevant attributes (discrete or continuous) for the individual population 
unit: for example, the patient’s complete recorded clinical history, or a record of 
all the input and control parameters of the production run. To aid the modelling 
process, a data bank (of “training data”) is available consisting of 


D = {(w.zi,...2/),j =1,...,n}, 


recording all the attributes and (eventually known) states of the world for nr pre- 
viously observed units of the same population: for example, n previous patients 
presenting at the same clinic, or m previous runs of the same production process. 
We shall suppose that the ultimate objective is to provide, for the state of 
the world w of the new individual unit, a predictive distribution p(w | y(z), D), 
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where y denotes a generic element of the set of all possible {y;(-)./ = 1..... m} 
under consideration for defining covariates. If w is discrete, we typically refer to 
the problem as one of classification; if w is continuous, we refer to it as one of 
prediction. 

To simplify the exposition, we shall suppose that identification of the density 
p(-| y(z). D) is equivalent to the identification of y € YY. where ¥ denotes the 
class of all y under consideration. The particular forms in Y will depend. of course. 
on the practical problem under consideration: typically, however. it will include 
functions mapping z to z;. i = L....,s, so that individual attributes themselves 
are also eligible to be chosen as covariates. Then. if u{p(- | y(z). D).w} denotes a 
utility function for using the predictive form p(- | y(z)..D) when w turns out to be 
the true state of the world. the resulting decision problem is shown schematically 
in Figure 6.4. 


(z.w) 
u(p(.Ly(z).D).w) 


Figure 6.4 Selection of covariates as a decision problem 


If p(z, w | D) represents the predictive distribution for (z. w), given the “train- 
ing data” D, the different possible models corresponding to the different possible 
choices of covariates. y € Y. are then compared on the basis of their expected 
utilities 


u(y|D) = [ wl ule). D).w}p(2.0 | Didades yey. 


The resulting optimal choice will, of course, depend on the form of the utility 
function. Typically, the latter will not only incorporate a score function component 
for assessing p(-| y(z)..D), but possibly also a cost component. reflecting the 
different costs associated with the use of different covariates y. For example, in the 
case of disease classification the use of fewer covariates could well mean cheaper 
and quicker diagnoses: in the case of predicting production quality the use of fewer 
covariates could cut costs by requiring less on-line measurement and monitoring. If 
we suppose, for simplicity, that the utility function can be decomposed into additive 
score and cost components, 


s{p(-|y(z). D).w} - cfy(x)}. 
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the expected utility of the choice y is given by 


a(y|D) = [ s{0(-Iy(2),D),w)pl2,w] D)dede - j cly(z)}dz. 


In many cases, it will be natural to use proper score functions, for example, quadratic 
or logarithmic. If costs are omitted, the optimal model will typically involve a large 
number of covariates; if cost functions are used which increase with the number of 
covariates in the model, a small subset of the latter will typically be optimal. More 
pragmatically, one could ignore costs, identify the optimal Vii» i= 1,2,... over 
all possible choices of one covariate, two covariates, etc., observe that 


@(yi1)|D) < H(yi)|D) <--- S$ ayi,|D) <--- 


is typically concave, reflecting marginal expected utility for the incorporation of 
further covariates, and hence select that y/,, for which i(y/,,,, |) — @(y(,, | D) 
is less than some appropriately predefined small constant. 

Given the complexity of problems of this type, the set Y = M of possible 
models is typically a rather pragmatically defined collection of stylised forms, 
and, recalling the discussion of Section 6.1.2, an M-closed perspective would 
not usually be appropriate. In fact, in most applications p(z,w | D) is likely to 
prove far too complicated for any honest representation, so that, in the terminology 
of Section 6.1.2, we need to perform a comparison of the models in ¥ from the 
M-open perspective. There are interesting open problems in the development of 
the cross-validation techniques, that might be employed in particular cases, but 
discussion of these would take us far into the realm of methods and case-studies, 
and so will be deferred to the second and third volumes of this work. 


6.2 MODEL REJECTION 
6.2.1 Model Rejection through Model Comparison 


In the previous section, we considered model comparison problems arising from the 
existence of a proposed range of well-defined possible models, M = {A/;, i € I}, 
for observations 2, where the primary decision consisted in choosing m,,i € I, 
with the implication of subsequently acting as if the corresponding Af;,i € J, were 
the predictive model. 

In this section, we shall be concerned with the situation which arises when 
just one specific well-defined model for x, My say, has been proposed initially, and 
the primary decision corresponds either to the choice 7m, which corresponds to 
subsequently acting as if Mf) were the predictive model, or to the choice mg (thus, 
in a sense, rejecting Mp), with the implication of “doing something else. If, given 
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T(r ; x) 


“ulmi, | x2)" ? 


Figure 6.5 Model rejection as a decision problem 


x, ti(-|x) denotes the ultimate expected utility of a primary action, this model 
rejection problem might be represented schematically by Figure 6.5. 

Such a structure may arise, for example, as a consequence of A/y being the 
only predictive model thus far put forward in a specific decision problem context: 
or, as a consequence of the application of some kind of principle of simplicity 
or parsimony, as an attempt to “get away with” using Afy, instead of using more 
complicated (but. in this context. unstated) alternatives. 

What perspectives might one adopt in relation to this, thus far clearly ill- 
defined, problem of model rejection? 

If we are concerned with coherent action in the context of a well-posed decision 
problem, we see from Figure 6.5 that we cannot proceed further unless we have 
some method for arriving at a value of i(7{, | x) to compare with ti(7m | az). One 
way or another, we are forced to consider alternative models to AJ). 

Let us suppose therefore that we have embedded AJ, in some larger class 
of models M = {AJ/;.i € I}. This might be done. particularly where AJ, has 
been put forward for reasons of simplicity or parsimony. by consideration of actual 
alternatives to Afg thought (by someone) to be of practical interest. Otherwise, it 
might be done by consideration of formal alternatives, generated by selecting. in 
some way, a “mathematical neighbourhood” of 14, (which might also. of course. 
contain alternatives of practical interest). For this redefined problem of model 
rejection within M, shown schematically in Figure 6.6. the hitherto undefined 
value of &(mf | x) becomes 


u(nr |x) = max am, | x). 
vel! 


where I' = I — {0} indexes the models in M distinct from Af. 

For any specific decision problem, the calculation of t(m, | x), € I proceeds 
as indicated in Section 6.1. Thus, if we adopt the /-closed perspective, evaluations 
are based on mixture forms involving prior and posterior probabilities of the A/;. 
i € I; if we adopt the 1-completed perspective, the calculation is. in principle. 
well-defined, but may be numerically involved; if we adopt the .1-open perspective. 
we may use a cross-validation procedure to estimate the expected utilities. 
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u(mo |x) 


mo 


U(m,; |x) 


Figure 6.6 Model rejection within M = {M,,i € I} 


In the case where { M;,i € J'} consists of actual alternatives to Mo, we might 
regard the redefined model rejection problem as essentially identical to the model 
comparison problem, so that rejecting 7 corresponds to choosing the best of m,, 
i € I’. However, this would seem to ignore the fact that when Mo has been put 
forward for reasons of simplicity or parsimony there is an implicit assumption that 
the latter has some “extra utility”, over and above the expected utility &(7po | x). 
Thus, if &(m§| a) — &(7mp | a) were positive, but not “too large”, we might still 
prefer Afq because of the special “simple” status of Mp. The same argument 
applies even more forcibly in the case where {M,,i € J'} consists of formal 
alternatives to Mo, since rejecting My may not lead obviously to an actual alternative 
model, and the “extra utility” of choosing Mg if at all possible may be greater. 
From this perspective, the redefinition of the problem of model rejection as one of 
model comparison corresponds to modifying slightly the representation given in 
Figure 6.6, by replacing t(mo | a) by %(7 | a) + €o(a) where € (a) represents, 
given a, an implicit (but as yet undefined) extra utility relating to the special status 
of Mo. (See Dickey and Kadane, 1980, for related discussion.) 

The formulation of the model rejection problem given above is rather too 
general to develop further in any detail. In order, therefore, to provide concrete 
illustrative analyses, we shal] assume, for the remainder of this chapter that M = 
{ Mo, M,}, where, for some parametric family {p{. | 9), @ € QO}, predictive models 
for x are defined, for some Qo C 9, by: 


Mos po) [ p(x | @)po(0)d@ 
(9) 


Mi: pi(2) = [ p(w | 0)p:(8)d8. 


Specially, Mo will correspond to either: 
(i) p(a|@o), a simple hypothesis that @ = 0 (specified by a degenerate prior 
po(@) which concentrates on @9), or 
(ii) p(x | dy, X), a simple hypothesis on a parameter of interest @ where A is a 
nuisance parameter (specified by a prior p9(@) = p(A | @p) which concentrates 
on the subspace defined by = yp). 
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The next three sections consider some detailed model rejection procedures 
within this parametric framework. 


6.2.2 Discrepancy Measures for Model Rejection 


Within the parametric framework described at the end of the previous section. 
model Af) corresponds to a form of parametric restriction (or “null hypothesis”) 
imposed on model Af;. In such situations, it is common practice to consider the 
decision problem of model rejection to be that of assessing the compatibility of 
model Af) with the data x, this being calibrated relative to the wider model A/; 
within which AJ isembedded. We shall focus on this version of the model rejection 
problem with utilities defined by u(r, @), u(im,,@). 8 € O, and overall beliefs. 
p(@|ax).@ € O, defined either by an .M-closed form, 


P(Mo | ax)p(@ | Aly. x) + P(A, |x)p(@ | Af,.a). 


with M = {Afo, Af, }, or by the {A/,}-closed form, p(@| Af. x), the latter pro- 
viding a kind of “adversarial” analysis, since it assigns Af) no special status. 
Noting that there are only two alternatives in this decision problem, it suffices 
to specify the (conditional) difference in utilities. say in favour of the larger model 
AL, 
6(0) = u(m,.@) — u(ing. 8). 


since the optimal inference will clearly be to reject model A/o if and only if 
Jiwem. 6) — u(r. )) p(@| x) dO > eo(a). 


where ¢o(x) represents, as before, the utility premium attached to keeping the 
simpler model Afy. We shall refer to 6(@) as a (utility-based) discrepancy measure 
between Af, and Af; when @ € © is the true parameter. In terms of the discrepancy, 
the optimal action is to reject model A/p tf and only if 


t((x) > s(x). 


where 


t(x) = A 6(0)p(0 | x)d0. 


With a considerable reinterpretation of conventional statistical terminology, 
we might refer to ¢(-) as a test statistic, leading, for given data x, to the rejection 
of model Af) if the observed value of the test statistic exceeds a critical value. 
c(x) = £q (x). 
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How might c(a) be chosen? One possible approach could be to consider, prior 
to observing x and assuming Ado to be true, a choice of c(-) which would lead Mo 
to only be rejected with low probability (a, say, for values of a of the order of 0.05 
or 0.01). Under this approach, we would choose c(-) such that 


P(t(ax) > c(x)| Mo) = a, 


thus obtaining the (1 — a)th percentage point of the predictive distribution of t(a), 
conditional on information available prior to observing x and assuming Afp to be 
true. Of course, this is just one possible approach to selecting c(-) and has no 
special theoretical significance. It is interesting that this choice turns out to lead 
to commonly used procedures for model rejection which have typically not been 
derived or justified previously from the perspective of a decision problem (see, 
for example, Box, 1980, for a non-decision-theoretic approach). Examples will 
be given in the following two sections. However, for criticism of the practice of 
working with a fixed a, see Appendix B, Section 3.3. 


6.2.3 Zero-one Discrepancies 


Suppose that the discrepancy measure introduced in Section 6.2.2 is defined to be 


_ fo ifo=6) 
(0) = 43 if 0 4 8p. 


Assuming the decision problem of model rejection to be defined from the M-closed 
perspective, we obtain 


t(a) = } 5(6)p(0 | x)d0 
= p(M, |x) 


= p(M))p(z | M,) 
p(Mo)p(x | Afo) + p(M,)p(a | M1) 


> 


where 
hz Mey we | 8o) if Ao specifies a simple hypothesis 
POETS NS pel bo, AP(AGo)AA if 8 = (by, A). 

and 


p(x| My) = / p(x | 8)p1(6)48. 


It follows from the analysis given in the previous section that Mo should be 
rejected if and only if, for specified critical value c(a), 


P(M,|x) > c(zx). 
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i.e., if the posterior odds on Afy and against Af, are smaller than (1 ~ c(a@))/c(z). 
In the case where the prior odds are equal, the rejection criterion for Afp in terms 
of the Bayes factor ts given by 


1 —c(a) 


Boi(@) < 


Example 6.10. (Null hypothesis for a binomial parameter). 
Suppose that x represents .r successes in 7: binomial trials, with Ay, .\/, defined by 


My: po(@) = po(a'| nn) = Bir | n.6u). 


\ 
My: p(w) =pi(rin) = [ Bi(.c | n.@)Un(@| 0. 1)dé. 


Alt 
and P(M,) = P(Af,) = §. Straightforward manipulation shows that 


T(2 + 2) 


By (xz) = Pr+ DP (rn —2 +1) 


Gy(1 -— Ay)? %. 
which, assuming, for purposes of illustration, large .r, 1. ~.r, and applying Stirling's formula, 
log P(Qa + 1) (2 + 1) log(n) — n+ f log(27) + (12n)!. can be approximated by 


2 


te 
By (x) = E-ern exp {-5V7(2)}. 


278,(1 = un 
where ; me i 
2a) = (w—nOy)- | (n= ae — nC - A) 
ks aaa | Ta n(l — 8B) 


is the usual chi-squared test statistic. By considering — 2 log B,.(z), for given 1, #,, a value 
of e(a) = (6.2) which calibrates the procedure to only reject Af, with probability a when 
Af, is true, is defined by the equation 


n ( (Aa. n) : = 
2rO(1~O,)| \1 — at) = exp {Xia 


where \},, denotes the upper 1000‘% point of a \7 distribution. Of course, having decided 
on this particular approach to the choice of «(a), there is no real need to identify it! The 
rejection procedure is simply defined by comparing the test statistic value, \?(a). with its 
tail critical value, \7,,. The reader might like to verify (perhaps with the aid of Jeffreys. 
1939/1961. Chapter 5) that similar results can be obtained for a variety of “null models” in 
more general contingency tables. 
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6.2.4 General Discrepancies 


Given our systematic use throughout this volume of the logarithmic divergence 
between two distributions, an obviously appealing form of discrepancy measure is 
given by 
F p(x | @) 
6(0) = [ zie log ———- dz, 
) p(x | 8) 
where p(x | @) is the general parametric form of model Mj. 
In the case of a location-scale model, 09 = (0,0), 8 = (4,0), we might 
consider the standardised measure 


6(0) = (458). 


In any case, the general prescription will be to reject Mp if t(a) > c(x), for some 
appropriate c(a), where 


te) = / 5(0)p(0 | x)a9, 


with p(@ | x) derived either from an M-closed model, as illustrated in Section 6.2.3, 
or from the “adversarial” form corresponding to assuming model Af). 


Example 6.11. (Lindley’s paradox revisited, again). In Examples 6.6 and 6.9 we 
considered the use of models M,, Afj defined, for 2 = (2:),....2,,), by 


M,: p(x) = Il N(x; }f0.A), to. A known, 


Mz: p(x) = [TN [geeA). NC] pers Ar)dye. pty. A, A known. 
t=1 


Using the logarithmic divergence discrepancy, we obtain 


p(x | 4.2) 
a |p, A) log —— a 
[fo | ye ars SY a 


N(x | 4, ) 


nf N(z| nd) lon xr 
nr 


“a te — bo)”, 


5(u) 


which is just a multiple (by n/2) of a natural, standardised, measure (the non-centrality 
parameter) suggested by intuition as a discrepancy measure for a location scale family. 
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Assuming the reference prior for jz derived from an {A/,}-closed perspective. which. 
as a special case of Proposition 5.24. is easily seen to be uniform. we have the reference 
posterior 

pir fa) = N(pe| Fond). 
and hence 
nad 


t(x) = a i (et po) PNG Fold 


1 
= 5ll + 2°(z)| 


where, with a? = A', the statistic (a) = V/n(F -- fy) /o is seen to be a version of 
the standard significance test statistic for a normal location null hypothesis. With respect 
to p(x | On), this has an N(z|0.1) distribution and the appropriately calibrated value of 
e(a) is thus implicitly defined, for example. by rejecting Af, if | <(a) | exceeds the upper 
100(ee/2)% point of an N(-| 0.1) distribution. 


The above analysis is easily generalised to the case of unknown X. Here. the reference 
posterior for (j. A) from an {.\/,}-closed perspective (see Example 5.17) is given by 


plu. Ala) = Nye | F. nd) Ga(A] $1 = 1). as). 


where ns? = (a, — %)*. It follows that 


1 x » 7 ; 
f(z) = 5 i [ WAC = po) N (af FA) Ga (A | tin - 1). 5ns*) dydX 
Joy SO 
1 ba st 3 
= 3f [1+ ACF = yn)?] Ga (A! E(n = 1). dus?) dpa 
th 
1 


where, with 5’ = [rs°/(a — 1)]!*. we see that /7(¥ — jtn)/s’ is a version of the standard 
significance test statistic for a normal location null hypothesis in the presence of an unknown 
scale parameter. With respect to p(a | @,), this has a St(t} 0. 1.1 — 1) distribution and the 
appropriately calibrated value of ¢-(a) is defined by the standard rejection procedure. 


The reader can easily extend the above analyses to other stylised test situations: 
for example, testing the equality of means in two independent normal samples, with 
known or unknown (equal) precisions. Rueda (1992) provides the gencral expres- 
sions for one-dimensional regular exponential family models. Multivariate normal 
location cases are also easily dealt with, the logarithmic divergence discrepancy in 
this case being proportional to the Mahalanobis distance (see Ferrandiz, 1985). We 
shall not pursue such cases further here, since it seems to us that detailed discus- 
sion of model rejection and comparison procedures all too easily becomes artificial 
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outside the disciplined context of real applications of the kind we shall introduce in 
the second and third volumes of this work. From the perspective of this volume, we 
have taken the analyses of this chapter sufficiently far to demonstrate the creative 
possibilities for model choice and comparison within the disciplined framework of 
Bayesian decision theory. 


6.3 DISCUSSION AND FURTHER REFERENCES 
6.3.1 Overview 


We have argued that both from the perspective of a sensitive individual modeller, 
and also from that of a group of modellers, there are frequently strong reasons for 
considering a range of possible models. 


This obviously leads to the problem of model comparison, or model choice, 
and our approach has been to consider formally a decision problem where the 
action space is the class of available models. In this setting, we have shown that 
“natural” Bayesian solutions, such as choosing the model with the highest posterior 
probability, are obtained as particular cases of the general structure for stylised, 
appropriately chosen, loss functions. 


We have also considered the generally ill-posed problem of model rejection, 
where the primary decision consists in acting as if the proposed model were true 
— without having specific alternatives in mind —and have shown that useful results 
may be obtained by embedding the proposed model within a larger class, and then 
using discrepancy measures as loss functions in order to decide whether or not the 
original simpler model may be retained after all. 


There is an extensive Bayesian literature directly related to the issues discussed 
in this chapter. Some authors adopt a purely inferential approach, by deriving either 
posterior probabilities, or Bayes factors for competing models; see, for example, 
Lindley (1965, 1972), Dickey and Lientz (1970), Dickey (1971, 1977), Leamer 
(1978), Bernardo (1980), Smith and Spiegelhalter (1980), Zellner and Siow (1980), 
Spiegelhalter and Smith (1982), Zellner (1984), Berger and Delampady (1987), Pet- 
tit and Young (1990), Aitkin (1991), Gémez-Villegas and Gémez (1992), Kass and 
Vaidyanathan (1992), McCulloch and Rossi (1992) and Lindley (1993). Others 
openly adopt a decision-theoretic approach; see, for example, Karlin and Rubin 
(1956), Raiffa and Schlaifer (1961), Schlaifer (1961), Box and Hill (1967), DeG- 
root (1970), Zellner (1971), Bernardo (1982, 1985a), San Martini and Spezzaferri 
(1984), Berger (1985a), Bernardo and Bayarri (1985), Ferrandiz (1985), Poskitt 
(1987), Felsenstein (1992) and Rueda (1992). 
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6.3.2 Modelling and remodelling 


We have already argued that we see Bayesian statistics as a rather formalised 
procedure for inference and decision making within a well-defined probabilistic 
structure. 

Fully specified belief models are an integral part of this structure, but it would 
be highly unrealistic to expect that in any particular application such a belief model 
will be general enough to pragmatically encompass a defensible description of 
reality from the very beginning. 

In practice, we typically first consider simple models, which may have been 
informally suggested by a combination of exploratory data analysis, graphical anal- 
ysis and prior experience with similar situations. And even with such a simple 
model, more formal investigation of its adequacy and the consequences of using 
it will often be necessary before one is prepared to seriously consider the model 
as a predictive specification. Such investigations will typically include residual 
analysis, identification of outliers and/or influential data, cluster analysis, and the 
behaviour of diagnostic statistics when compared with their predictive distribu- 
tions. We shall not elaborate on this here. Some relevant references are Johnson 
and Geisser (1982, 1983, 1985), Pettit and Smith (1985), Pettit (1986), Geisser 
(1987, 1992, 1993), Chaloner and Brant (1988), McCulloch (1989), Verdinelli 
and Wasserman (1991), Gelfand, Dey and Chang (1992), Weiss and Cook (1992). 
Guttman and Pena (1993), Pefia and Guttman (1993), and Chaloner (1994). 

Bayarri and DeGroot (1987, 1990, 1992a) provide a Bayesian analysis of 
selection models, where data are randomly selected from a proper subset of the 
sample space rather than from the entire population. 

As a consequence of this probing, mainly exploratory analysis, a class of 
alternative models will typically emerge. In this chapter we have discussed some 
of the procedures which may be useful in a forma/ comparison of such alternative 
models. The outcome of this strategy will typically be a more refined model for 
which a similar type of analysis may be repeated again. 

Naturally, a pragmatic combination of time constraints, data limitations, and 
capacity of imagination, will force this sequence of informal exploration and formal 
analysis to eventually settle on the use of a particular belief model, which hopefully 
can be defended as a sensible and useful conceptual representation of the problem. 

This remodelling process is never fully completed, however. in that either new 
data, more time, or an imaginative new idea, may force one to make yet another 
iteration towards the never attainable “perfect”, all powerful predictive machine. 


6.3.3 Critical issues 


We shall comment further on six aspects of the general topic of remodelling under 
the following subheadings: (i) Model Choice and Model Criticism, (ii) Inference 
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and Decision, (iii) Overfitting and Cross-validation, (iv) Improper Priors, (v) Sci- 
entific Reporting, and (vi) Computer Software. 


Model Choice and Model Criticism 


We have reviewed several procedures which, under different headings such as model 
comparison, model choice or model selection, may be used to choose among a class 
of alternative models, and we have argued that, from a decision-theoretical point 
of view, the problem of accepting that a particular model is suitable, is ill-defined 
unless an alternative is formally considered. See, also, Hodges (1990, 1992). 

However, partly due to the classical heritage of significance testing, and given 
the obvious attraction of being able to check the adequacy of a given model without 
explicit consideration of any alternative, non-decision-theoretic Bayesians have 
often tried to produce procedures which evaluate the compatibility of the data with 
specific models. 

As clearly stated by Box (1980), the posterior distribution of the parameters 
only permits the 


estimation of the parameters conditional on the adequacy of the entertained model, 
while the predictive distribution makes possible criticism of the entertained mode] 
in the light of current data. 


Moreover, the predictive distributions which correspond to different models 
are comparable among themselves, while —in general—the posteriors are not. The 
use of predictive distributions to check model assumptions was pioneered by Jef- 
freys (1939/1961). Additional references include Geisser (1966, 1971, 1985, 1987, 
1988, 1993), Box and Tiao (1973), Dempster (1975), Geisser and Eddy (1979), Ru- 
bin (1984), Bernardo and Bermidez (1985), Clayton er al. (1986), Gelfand, Dey 
and Chang (1992) and Giron et al. (1992). 

The basic idea consists of defining a set of appropriate diagnostic functions 
t; = ti(@n41,---,;2n+k) Of the data and comparing their actual values in a sample 
with their predictive distributions pg, (-|a1,...,2,) based on a different sample 
from the same population. Possible comparisons include checking whether or not 
the observed t;’s belong to appropriate predictive HPD intervals, or determining the 
predictive probability of observing ¢;’s more “outlying” than those observed. The 
reader will readily appreciate that common techniques such as residual analysis, 
identification of influential observations, segregation of homogeneous clusters, or 
outlier screening, can all be reformulated as particular implementations of this 
general framework. 

As mentioned before, we see these very useful activities as part of the infor- 
mal process that necessarily precedes the formulation of a model which we can 
then seriously entertain as an adequate predictive tool. However, it seems to us 
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inescapable that if a formal decision is to be reached on whether or not to operate 
with a given model, then some form of alternative must be considered. 

For further discussion of model choice, see Winkler (1980), Klein and Brown 
(1984), Krasker (1984), Florens and Mouchart (1985), Poirier (1985), Hill (1986, 
1990), Skene et al. (1986) and West (1986). 


Inference and Decision 


Throughout this volume, we have emphasised the advantages of using a formal 
decision-oriented approach to the stylised statistical problems which represent a 
large proportion of the theoretical statistical literature. These advantages are spe- 
cially obvious in model comparison since, by requiring the specification of an 
appropriate utility function, they make explicit the identification of those aspects 
of the model which really matter. 

We have seen, moreover, that the more traditional Bayesian approaches to 
model comparison, such us determining the posterior probabilities of competing 
models or computing the relevant Bayes factors, can be obtained as particular cases 
of the general structure by using appropriately chosen, stylised utility functions. 

Very often, the consequences of entertaining a particular model may usefully 
be examined in terms of the discrepancies between the prediction provided by the 
model for the value of a relevant observable vector, ¢, say, and its actual. eventu- 
ally observed, value. Scoring rules, of the general type u(p,(-|2)..... x,,).t) 
provide natural utility functions to use in this context, by explicitly evaluating the 
degree of compatibility between the observed ¢ and its predictive distribution. 


Overfitting and Cross- Validation 


If we are hoping for a positive evaluation of a prediction it is crucial that the pre- 
dictive distribution is based on data which do not include the value to be predicted; 
otherwise, severe overfitting may occur. Pragmatically, however, although it is 
sometimes possible to check the predictions of the model under investigation by 
using a totally different set of data than that used to develop the model. it is far 
more common to be obliged to do both model construction and model checking 
with the same data set. 


A natural solution consists of randomly partitioning the available sample z = 
{a1..... Z, } say, into two subsamples { 2). z2} one of which is used to produce the 
relevant predictive distributions, and the other to compute the diagnostic functions; 
the procedure then being repeated as many times as is necessary to reach stable 
conclusions. This technique is usually known as cross-validation. For recent work, 
see Pefia and Tiao (1992), Gelfand, Dey and Chang (1992), and references therein. 
For a discussion of how cross validation may be seen as approximating formal 
Bayes procedures, see Section 6.1.6. 
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A possible systematic approach to cross-validation starting from a sample of 
size n, z = {%,...,2,}, and a model {p(x |@), p(@)}, involves the following 
steps: 

(i) Define a sample size k, where & < n/2 is large enough to evaluate the relevant 
observable function t = t(2),...,2,). The observable function could either 
be that predictive aspect of the model which is of interest, as described by the 
utility function, or a diagnostic function, as described in the above approach 
to model criticism. 

(ii) Determine the set of all predictive distributions of the form 


pj(t(z;)|zyj). J= 1... (). 


where 2; is a subsample of z of size k and z,,) consists of all the 2;°s in z 
which are not in z,. 
(iii) Estimate the expected utility of the model by 


(2) Cuoseley pete): 
J 


Note that the last expression is simply a Monte Carlo approximation to the exact 
value of the expected utility. We also note that this programme may be carried out 
with reference distributions since the corresponding reference (posterior) predictive 
distributions 7;(t(z,) | 2) will be proper even if the reference prior 7(@) is not. 


Improper Priors 
In the context of analysis predicated on a fixed model, we have seen in Chapter 5 
that perfectly proper posterior parametric and predictive inferences can be obtained 
for improper prior specifications. 

When it comes to comparing models, however, in general the use of improper 
priors is much more problematic. We first note that for models 


My{p;(2\@),p,(@)}, 1 € I, 
the predictive quantities 
px) = f p(el0)p.(0)48, ie 1, 


typically play a key role in model comparisons for a range of specific decision 
problems and perspectives on model choice. But if one or more of the p;(@) is not 
a proper density, the corresponding p;(z)'s will also be improper, thus precluding, 
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for example, the calculation of posterior probabilities for models in an M-closed 
analysis. 

Essentially, with a formal improper specification for the prior component of 
a model an initial amount of data needs to be passed through the prior to posterior 
process before the model attains proper status as a predictive tool and can hence be 
compared with other alternative predictive tools. 

An exception to this arises when two models A/;. AZ;. say, have common @ 
and improper p(@), in which case it can be argued that the ratio p;(a)/p,(a), the 
Bayes factor in favour of A/;, does provide a meaningful comparison between the 
two models, However, there is an inherent difficulty with these methods when the 
models compared have different dimensions. 

Indeed, with the reference prior approach some models are implicitly disad- 
vantaged relative to others, technically, this is due to the fact that the amount of 
information about the parameters of interest to be expected from the data crucially 
depends on the prior distribution of the nuisance parameters present in the model. 
Lindley’s paradox (see, also, Bartlett, 1957), discussed earlier in this chapter, is a 
well known simple example of this behaviour. 

A possible solution consists in specifying the improper prior probabilities of 
the models — or, equivalently, weighting the Bayes factors —in a way which may be 
expected to achieve neutral discrimination between the models. Some suggestions 
along these and similar lines include Bernardo (1980), Spiegelhalter and Smith 
(1982), Pericchi (1984), Eaves (1985) and Consonni and Veronese (1992a). 

Another possible solution to the problem of comparing two models in the 
case of improper prior specification for the parameters is to exploit the use of 
cross-validation as a Monte Carlo approximation to a Bayes decision rule. 

As we saw in Section 6.1.6, for the problem of predicting a future observation 
using a log-predictive utility function the (Monte Carlo approximated) criterion for 
model choice involves the geometric mean of Bayes factors of the form 


‘ J ue, VW) 
Byo(#j-@n-1(j)) = ae ) 


where (a,,-1(j),j] denotes a partition of x = (2)..... z,),and 
Pi (x) | By. i(j)) = ic (x, | 0;)p, (9, | x,,-1(J)) d6,. 


Since, for sufficiently large n. p,(0; | z,,-;(j)) will be proper. even for improper 
(non-pathological) priors p;(@;), no problem arises. 

However, recalling from our discussion in Section 6.1.6 that the conven- 
tional Bayes factor is used to assess the models’ ability to “predict 2” from 
{p,(x | 8,), p:)(8,)}, we see that the latter does run into trouble if p;(@,) is im- 
proper, since, then, p;(a) is not proper. 


6.3 Discussion and Further References 423 


Proceeding formally, if we were to take log p;(a) as the utility of choosing 
M;,, the M-open perspective prefers M, if 


pile) 
[ve p(n) dz > 0, 


where p(x) is not specified. Again, we can approach the evaluation of the left- 
hand side as a Monte Carlo calculation, based on partitioning a and averaging over 
random partitions. However, in this contect we want partitions where the proxy 
for the predictive part resembles data 2 and the proxy for the conditioning part 
resembles “no data”. The closest we can come to this, and overcome the problem 
of the impropriety of p;(@;), is to take partitions of the form x = (x.(3), En-s(J)] , 
where s(> 1) isthe smallest integer such that both p; (@; | 25(7)) and p2(@2 | x5(7)) 
are proper, and j = 1,...,("). 

The proposal is now to select randomly & such partitions, and approximate the 
left-hand side of the criterion inequality by 


Ly fog | Pu(@n-s(i) 1ea(5)) | 
K 2, 08 aes (x,- Eh 


The (Monte Carlo approximated) model choice criterion then becomes, prefer 
M, if 


E [pi(an-a(3)lee(3)) | at k 
I Beano = T] [Balen sas)" > 4 


where By2(@,-s(j),@s(j)) denotes the Bayes factor for Af, against M2, based on 
the versions 


{Pi(tn-s() |9;), pi (4, j2.(a))} 


of Af,, Mo. Again, we see the explicit role of the geometric average of Bayes 
factors, but with the latter “reversing”, in a sense, the role of past and future data 
compared with the form obtained in Section 6.1.6. 


At the time of writing we are aware of work in progress by several researchers who 
propose forms related to those discussed here. These include: J. O. Berger and 
L. R. Pericchi (intrinsic Bayes factors), A. O°Hagan (fractional Bayes factors) 
and A. F. de Vos (fair Bayes factors). 
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From a practical perspective, it might be desirable to trade-off, in utility terms, 
“fidelity to the observed data” and “future predictive power”. This can be formalised 
by adopting a utility function of the form 


alog{p,.(x)] + (1 — a) logip(y|x)]. O<a< 1, 


which, in turn, leads to a criterion of the form, prefer A/, if 


a k fan 


k : - 
[[lem(e.ur-2co') {TL Bates!) >a 
j=l 


jot 


Work in progress (by L. R. Pericchi and A. F. M. Smith) suggests that such a 
criterion effectively encompasses and extends a number of current criteria. 

We conclude by emphasising again that a predictivistic decision-theoretical 
approach to model comparison, where models are evaluated in terms of their pre- 
dictive behaviour, bypasses the dimensionality issue, since posterior predictive 
distributions obtained from models with different dimensions are always directly 
comparable. 


Scientific Reporting 

Our whole development has been predicated on the central idea of normative stan- 
dards for an individual wishing to act coherently in response to uncertainty. Beliefs 
as individual, personal probabilities are the key element in this process and, at any 
given moment in the leaming cycle, are the encapsulation of the current response 
to the uncertainties of interest. 

However, while many are willing to concede that, in narrowly focused de- 
cision-making, such beliefs are an essential element, there has also been a wide- 
spread view (see, for example, Fisher, 1956/1973) that it would be somehow sub- 
versive to sully the nobler, objective processes of science by allowing subjective 
beliefs to enter the picture. As Dickey (1973) has remarked, rhetorically: 


But is not personal knowledge, or opinion, like superstition, non-objective and 
unscientific, and therefore to be avoided in science? Who cares to read about a 
scientific reporter's opinion as described by his prior and posterior probabilities? 


We have already made clear our own general view that objectivity has no 
meaning in this context apart from that pragmatically endowed by thinking of it as 
ashorthand for subjective consensus. However, there are clearly practical problems 
of communication between analysts and audiences which need addressing. 

The solution to such problems lies in combining ideas from Chapters 4 and 6. 
On the one hand, we have seen that shared assumptions about structural aspects 
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of beliefs (for example, exchangeability) can lead a group of individuals to have 
shared assumptions about the parametric model component, while perhaps differing 
over the prior component specification. On the other hand, we have seen, from 
several perspectives, that entertaining and comparing a range of models fit perfectly 
naturally within the formalism. There is nothing within the world of Bayesian 
Statistics that prohibits a scientist from performing and reporting a range of “what 
if?” analyses. To quote Dickey again: 


... Communicating a single opinion ought not to be the purpose of a scientific 
report; but, rather, to let the data speak for themselves by giving the effect of 
the data on the wide diversity of real prior opinions. ... an experimenter can 
summarise the application of Bayes’ theorem to whole ranges of prior distribu- 
tions, derived to include the opinions of his readers. Scientific reports should 
objectively exhibit as much as possible of the inferential content of the data, 
the data-specific prior-to-posterior transformation of the collection of all per- 
sonal probability distributions on the parameters of a realistically rich statistical 
model. 


We believe that this is the way forward—although, it has to be said, there 
is a great deal of work to be done in effecting such a cultural change. There are 
also some obvious technical challenges in making such a programme routinely 
implementable in practice. However, computational power grows apace, as does 
the sophistication of graphical displays. We shall return to this general problem in 
the second and third volumes of this work. 

For early thoughts on these issues, see Edwards ef al. (1963) and Hildreth 
(1963); for technical expositions, see Dickey (1973), Roberts (1974) and Dickey 
and Freeman (1975); for a discussion in the context of a public policy debate, see 
Smith (1978). 


Computer Software 


Despite the emphasis in Chapter 2 and 3 of this volume on foundational issues — 
necessary for a complete treatment of Bayesian theory—we are well aware that 
the majority of practising statisticians are more likely to be influenced by positive, 
preferably hands-on, experience with applications of methods to concrete problems 
than they ever will be by philosophical victories attained through the (empirically) 
bloodless means of axiomatics and stylised counter-examples. We are also well 
aware that the availability of suitable software is the key to the possibility of ob- 
taining that hands-on experience. 

But what are the appropriate software tools for Bayesian Statistics? What 
software? For whom? For what kinds of problems and purposes? 

A number of such issues were reviewed in Smith (1988), but at the time of 
writing, many still remain unresolved. Goel (1988) provided a review of Bayesian 
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software in the late 80’s. Examples of creative use of modern software in Bayesian 
analysis include: Smith et al. (1985, 1987) and Racine-Poon et al. (1986), who 
describe the use of the Bayes Four package; Grieve (1987); Lauritzen and Spiegel- 
halter (1988), on which the commercial expert system builder Ergo™ is based; 
Albert (1990), Korsan (1992), and Ley and Steel (1992), who make use of the com- 
mercial package Mathematica™ ; Tierney (1990, 1992), who presents LISP-STAT. 
an object oriented environment for statistical computing and discusses possible 
uses of graphical animation; Cowell (1992) and Spiegelhalter and Cowell (1992). 
who, respectively, describe and apply the probabilistic expert system shell BA/ES; 
Racine-Poon (1992), who discusses sample-assisted graphical analysis, Thomas 
et al. (1992), who describes BUGS, a program to perform Bayesian inference 
using Gibbs sampling; Wooff (1992), who describes /B/Dj. an implementation 
of subjectivist analysis of beliefs, as described by Goldstein (1981, 1988, 1991. 
and references therein); and Marriot and Naylor (1993), who discuss the use of 
MINITAB to teach Bayesian statistics. Further review and detailed illustration will 
be provided in the volumes Bayesian Computation and Bayesian Methods. 
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Appendix A 


summary of 
Basic Formulae 


Summary 


Two sets of tables are provided for reference. The first records the definition, and 
the first two moments of the most common probability distributions used in this 
volume. The second records the basic elements of standard Bayesian inference 
processes for a number of special cases. In particular, it records the appropriate 
likelihood function, the sufficient statistics, the conjugate prior and correspond- 
ing posterior and predictive distributions, the reference prior and corresponding 
reference posterior and predictive distributions. 


A.1. PROBABILITY DISTRIBUTIONS 


The first section of this Appendix consists of a set of tables which record the 
notation, parameter range, variable range, definition, and first two moments of the 
probability distributions (discrete and continuous, univariate and multivariate) used 
in this volume. 
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Univariate Discrete Distributions 


Br(x|9) Bernoulli (p. 115) 


0O<@<1 x=0.1 
p(x) = 67(1 — 6)!" 
E(x] = 0 V{[z] = (1 - 6) 


Bi(x{6,n) Binomial (p. 115) 


0<d<1ln=1,2,... r=0.].....n 


p(x) = (“Jeru Se a 
E(x] = n6 V [xz] = nO(1 - @) 


Bb(z|a.3.n) Binomial-Beta (p. 117) 


a>0,3>0.n=1,2,... e=O0,1..... n 
fn ba a T(at+ 3) 
p(z)=c @lic +az)P(8+n- 2) c= Tahara +340) rary 
7 a ned (a+3+n) 
Elz] Leora i] (a+ 3) (a+ 341) 


Hy(z|N.Af.n) Hypergeometric (p. 115) 


N =1.2,... v=aati...., b 
Af =1,2,... a= max(0.n — AL) 
n=1,....N+M b= min(n. NV) 
« ( M } (" + m) = 
p(x) =C¢ c= 
z/\n-2x n 

N peal. RUN AL ONE AD ai 

Ele S te Viel = Wan NG API 
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Univariate Discrete Distributions (continued) 


Nb(z|6,r) Negative-Binomial (p. 116) 


0<@<1l,r=1,2,... r=0,1,2,... 
Grech" = * ase = 6 
P= r—-1 ome 

1-90 
E(x] = r@ V{(z] =r 7 


Nbb(z|a,3,r) Negative-Binomial-Beta (p. 118) 


a>0O, @>0, r=1,2... z=0,1,2,... 
_,f{rt+z—-1 I(3+2) __ Ta+r(at+r) 

ple) ~e( r~1 ) ret etera “= ~~ Ta)F(B) 
ee wipe | OES Te ie 

Blea + VES gaa | (a — 2) (a — 1)(a — 2) 

Pn(x|A) Poisson (p. 116) 

A>0 7 xr=0,1,2,... 

p(x) =¢ ~ c=e > 

E[z] = V[z] = 


Pg(z|a,3,n) Poisson-Gamma (p. 119) 


a>0, 3>0,v>0 x=0,1.2,... 
(ieee i o. Pee 
a oe a! (B+u)erF ~ Ta) 


Elz]=¥5 Vie] = 22 f + 5 
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Univariate Continuous Distributions 


Be(x|a.3) Beta (p. 116) 


a>0,3>0 O<r<l 
Ly _ Ta + 3) 
sept ly py! se 
p(t) =cre't—--r) c Tarts) 
a , ad 
pu ad3 i (a + 3)2(a + 34 1) 


Un(r}a.b) Uniform (p. 117) 


b>a a<ur<b 
p(r)=C c=(b-a)"! 
E(x] = $(a + 0) Vir] = p(b- ay 


Ga(x|a.3) Gamma (p. 118) 


a>0,3>0 r>od 
p(t) =e re te"™ c= 3 
E{z} = a3"! V[z] =a37? 


Ex(x|0) Exponential (p. 118) 


8>0 “> 
p(x) =ce™ c= 
Ez] = 1/6 Vix] = 1/62 


g(x |a.3.n) Gamma-Gamma (p. 120) 


a>v0,3>0,n>0 r>od 
get 3° Tlatn) 
p(x) = “ane c= Ta) Ti) 
3 Fn? + nla — 1)) 
sar Vil = aS e 2) 
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Univariate Continuous Distributions (continued) 


x?(r|v) = x? Chi-squared (p. 120) 


vy>0 z>0 
p(x) = crl¥!2)-le-s/2 = oe 
E[x] =v V(a] = 2v 
x?(xz|v,) Non-central Chi-squared (p. 121) 
vy>0,A>0 zr>0d0 
= d 

p(x) = zee (i | >) x?(x|v + 2%) 
Elz] =v+a V(x] = 2(v + 2d) 
Ig(z|a,8) Inverted-Gamma (p. 119) 
a>0,8>0 r>0 

-(a+l) ,-3/r pe 
p(x) =cxr ee" c= Tia) 

B B 
B= = 

eae MP ae 1)?(a — 2) 

x-'(x|v)  Inverted-Chi-squared (p. 119) 
v>0 x>0 

~({v}} - 2 (1/2)"/? 
p(z) =Ccr ( /2+1) 6 1/2r = 

T(v/2) 
Ela] = — Vie] = 
p=? “(y= 2)2(v — 4) 


Ga"'/2(r|a,8) Square-root Inverted-Gamma (p. 119) 


a>0,3>0 z>0 
p(x) =c poate 3/x? & Fa 
a@ 
VBI (a — 1/2) 


Elz] = V(a]=—" - Bla? 


Ta) 
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Univariate Continuous Distributions (continued) 


Pa(r|a.i3) Pareto {p. 120) 


a>03>0 3oan<ctx 
p(r)=ecx OF c= as" 

}) _ Ja , ae ar . 
Ejay] == - a>] Wish acer a>2 


Ip(x|a.i3)  Inverted-Pareto (p. 120) 


a>0d. 3>0 0O<r< 3! 
peer" ow. ik 
Elz) = 3-'al(a + 1)7! Vir] = 3 Za(at 1) ?(a4+2) 


N(x |.) Normal (p. 121) 


—-x << 4+x,A>0 -x<u ctx 
p(t) =¢ exp {-4A(a - )°} c= Many? 
Ela) = py VIaj=A'! 


St(a|u.A,a) Student t (p. 122) 


—x << p[e< +x%.A>0,a>0 —x<r<c4tx 
oy tan l)e2 F(s(at])) (2 i 
Hie) sece lt Vie — eye he ie 3 2 
p(x) reall +alX(a nu)" | c ra) ae 
Elz} = w V[r] = A*'al(a — 2)°! 
F(a }a.3) =F,.3 Snedecor F (p. 123) 
a>J3>0 «>od 
(x) peel I (i(at 3)) at23%? 
= a ce OO 
. (3+ axjer a? P(ga)r (23) 
Fel : 23°(a + 3 ~ 2) 
zj=——~—. 3g>2 Vir} = ———_.——. . 3 
Pil og a a eee 
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Univariate Continuous Distributions (continued) 


Lo(xja,G) Logistic (p. 122) 


-wo<a<+0,8>0 —0o <2 < +00 
p(x) = B' exp {-@7'(x — a)} [1 + exp {-3 (x - a)}]” 
Elz] =a Vz] = 3?x?/3 


Multivariate Discrete Distributions 


Mu;(z|6,n) Multinomial (p. 133) 


6 = (0;,..., 9%) x = (r,...,2%) 
0<6<1, meme? yo ae <n 
n=1,2,. x; =0,1,2,... 
k+l k k 
p(x) Eta a Le +1 = 1- So, Tey] =N— >> 
f=1 f=1 
E{zi] = a V[z;] = n6;(1 — 6) C{x;,2;] = —n6,0, 


Md; (2|@,n) Multinomial-Dirichlet (p. 135) 


@ = (a),..., x41) xz = (2},..., 2%) 
a; >0 xz; =0,1,2,... 
ce ey ae rer TiS 
ll atl n! 
pz) =e] 5 ae mT 
ear F (Xt! ax) 
all = Tres (a +@-1) Leo, =N—- De Ze 
n+ 
Elzi) = np, Ve) = BP EEL np ~p) 
1+ fa OE 
a; n+ 
t Clzi,2;] —— nt de ae 1 & NpiP; 


Pi > Ser j 7 0Ue 
(a1 OF l+dyie 
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Multivariate Continuous Distributions 


Di,(a|a@) Dirichlet (p. 134) 


a= (a0},.... Oper) w= (xy.....04) 
a; >0 ere t. Saeed 
k k kr] 
apy! ay-l T( t=! ar) 
p(x) =c(l- Tp rr c= Tt! ra) 
( 2 ) I Ie; ! r (a,) 
: Oy E[z,J ~ Efxi]) _, ,_ ~EleJElz,] 
Ein) = S— Vf, = FEED pe, e) = SE EUE 
peeee 14+ 02) a 1+ Oy7) a 


Na(a.y| fu. A.a.3) Normal-Gamma (p. 136) 


HER, A>Va>0. 3>0, rEeRy>d0 
p(r.y) = N(x |p, Ay) Ga(y| a. 3) 
E[x] = Ely] = a3! Vix] = 3A7 "(a = 17! V(yp sas? 


p(z) = i | ,037'A, 2a) 


Ny(a|pt.A) Multivariate Normal (p. 136) 


= (fy. -. fe) © RE @ = (rye...n, ry) € RE 
AX symmetric positive-definite 


p(x) = exp{—3(a@—p)'Aa—p)} c= |A[!P(2n) #2 
Ela} = Via] = 


Paz(a.y|a.5y, 31) Bilateral Pareto (p. 141) 


(39.31) € R?. 39 <3}. a> 0 WER <n y >t 
P(r.y) = e(y— a2)? = aa + 131 — jy)" 

O59 Ly OI = Fo Hat = Vig) = bac 
Peeper CO gem — MI ae cy 
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Multivariate Continuous Distributions (continued) 


Ng,(z,y|#,A,@, 3) Multivariate Normal-Gamma (p. 140) 


-—00 < pj < too, a>0, B>0 (x,y) = (21...-. 2%, Y) 

A symmetric positive-definite -m<2Z<00o, y>O 
p(x, y) = Nx(a| 4, Ay) Ga(y| a, 3) 

E(x, y] = (wu, a8"), V(x) = (a - 1) *BA"", V[y) = a8? 
p(x) = St.(z | 4, AaB! , 2a) p(y) = Galy| a, 3) 


Nwi(x, y | p,4,0,8) Multivariate Normal-Wishart (p. 140) 


—00 < pj < +00, A>0 w= (21,....2r) 
2a>k-1 —00 < 2; < +00 

8 symmetric non-singular y symmetric positive-definite 
p(x, y) = Ne(x| w, Ay) Wi(y | a, 8) 

E(x, y) = {H,097'} V[a] = (a — 1)71BA"! 
p(x) = St(x | 4, 40! 2a) p(y) = Wik(y | a, 8) 


St.(a | jt,A,a) Multivariate Student (p. 139) 


—00 < pj; < +00, a>0 x = (21,...,2%) 

AX symmetric positive-definite —w% < Zj < +00 

1 “er)2 (4a +h) 

=cll+—(@ ~ p)'A(x2 — ne RA yy i?2 
ple) =e[1+ Xe ~ p)'rle~w) revere 
Ela) =p, V(x] =A'(a-2)"'a 
Wi;(2|a,8) Wishart (p. 138) 
2a>k-1 x symmetric positive-definite 
B symmetric non-singular 
-k(k-1)/4) gla 

pa) =clal’-“)? exp{—tr(az)}  e= = 74 


~ TI Fa + 1 - 8) 
Ela] =o08"', Ela-'} = (a — *#)-'6 
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A.2. INFERENTIAL PROCESSES 


The second section of this Appendix records the basic elements of the Bayesian 
learning processes for many commonly used statistical models. 

For each of these models, we provide, in separate sections of the table, the 
following: the sufficient statistic and its sampling distribution; the conjugate fam- 
ily, the conjugate prior predictives for a single observable and for the sufficient 
statistic, the conjugate posterior and the conjugate posterior predictive for a single 
observable. 

When clearly defined, we also provide, in a final section, the reference prior 
and the corresponding reference posterior and posterior predictive for a single 
observable. In the case of uniparameter models this can always be done. We recall. 
however, from Section 5.2.4 that, in multiparameter problems, the reference prior 
is only defined relative to an ordered parametrisation. In the univariate normal 
model (Example 5.17), the reference prior for (j:. 4) happens to be the same as that 
for (A. j2), namely m(fe. A) = m(A, 42) x AW), and we provide the corresponding 
reference posteriors for jz and A, together with the reference predictive distribution 
for a future observation. 

In the multinomial, multivariate normal and linear regression models, how- 
ever, there are very many different reference priors, corresponding to different 
inference problems, and specified by different ordered parametrisations. These are 
not reproduced in this Appendix. 


Bernoulli model 


= {ec taty te {0.1} 
p(z; | 9) = Bra, | 8), 0<A<1 


2) =r = Da; 
p(r |@) = Bi(r|9.n) 


p(9) = Be(@ | a. ;3) 

p(x) = Bb(r|a. 3.1) 

p(r) = Bb(r | a, 3,2) 

p(o|z) = Be(@(at+r.3+n-r) 
p(z|z) = Bo(zja+r,3+n-—r.1) 


(0) = Be(@| 4.4) 
7(6|z) = Be(O|5+7,5 +n—-1) 
n(x|z) = Bbo(x| i + rt +n—r,1) 
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Poisson Model 


z= {z,....2n}, z; = 9,1,2,... 
p(x; |A) = Pn(a; |), A2z0 


t(z) =r = Yin, Zi 

p(r|A) = Pn(r| nd) 

p(A) = Ga(A | a, 8) 

p(x) = Pg(z| a, 3, 1) 

p(r) = Pg(z| a, 8,2) 

p(A|z) =Ga(Ala+r,3+n) 
p(z|z) = Pg(zl[atr,8+n,1) 


m(A) x Av}? 
m(A|z) = Ga(A|r + 4.n) 
m(x|z) = Pg(z|r + 5,7, 1) 


Negative-Binomial model 


z= (2),..., :Zn), z; =0,1,2.... 
p(2,|0) =Nb(z;|0,r), O<6<1 


t(z) eS ae vi 
p(s|@) = Nb(s|@.nr) 


0) = Be(6}a, 3) 
ie Nbb(z | a, 3,7) 

s) = Nbb(s|a, 3, nr) 
6|z) = Be(@l|atnr,8+s) 
L 


( 
( 
a 
(x|z) = Nbb(z/a+ nr, 3+ s.r) 


= SS 


n(0@) x @71(1 —@)71? 
n(6|z) = Be(O| nr, s+) 
m(x|z) = Nbb(r|nr,s + 3,7) 
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Exponential Model 


z= {xy,....ay}. O<r<x 
p(x, |@) = Ex(x, | 9), A>0 


p(@) = elas 3) 
pz) = Gg(z [a 3,1) 
p(t) = Gg(t|a.i3.n) 
p(O|z) = Ga(@la +n. 3+ t) 
p(x |z) = Gg(rla+n.34+t.1) 


m(@) x @7! 
n(O|z) = Ga(O|n. t) 
n(a|z) = Gg(r |r, t, 1) 


Uniform Model 


Ee (istcutnls. (Osa, <8 
1,|0) = Un(z;]0.8), 6 >0 

t(z) =t = max{az}......0,} 

p(t|@) = Ip(t |r! 


p(@) = Pa(@| a, 3) 
p(x) = aay Unt |0.,3). tifa < 3, Pa(z}a.i3), if > 3 
p(t) = SE Ip(t [nat !), ift <9. te Pa(tla.3). ift > 3 
p(O|z) = Pa(dlatn,3,), 3. = race 3, } 

p(n | 2) = M2 Un(r | 0. 3,,). if. < 3, 


atndl 


ai 


Pa(x[a.3,). ifr > 3, 


erat 


( 
w(@|z) = Pa(@|n.t) 
( 


a Un(r 0.8), if <t, — Pa(x|n. t), ifr >f 
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Normal Model (known precision X) 


z= {m,...,2n}, -—0 <r; < 00 
P(zilu.A) =N(zi|u,A),  ~-o <p< oo 


= N(p| Ho, Ao) 

= N (x| po, A Ao(Ao + A)7!) 

=N (Z| po, 7A Ao Az"), An = Ag + na, 

2) = N(z[ Mas An), Ln = Az! (Aoto +ndZz) 
z) =N(zr| pn, AAn(An + A)~!) 


SS 
SrASt 
— ee Se 


m{t) = constant 
(| z) = N(w| Z,nd) 
n(z|z) =N(xr]Z,An(n+1)7') 


Normal Model (known mean 1) 


z={x,...,tn}, —-oo< 4) < 00 
p(zi| uA) =N(xi|#,A),  A>0 


t(z) =t = (ai — w)? 
p(t |.) =Ga(t|3n, 3d), p(At) = x?(At |) 


(A) = Ga(A | a, 8) 
(rt) = ae af"! 2a) 
p(t) = Gg(z | a, 29, 5n) 
pir | z) = Ga(Ala + 5n, B+ 3t) 
p(x|z) = St(x| pu, (a+ $n)(8+ 5t)"!, 2a +n) 


m(A) x Aw! 
a(A| z) = Ga(A| 5n, 3t) 
a(x} z) = St(x| p,nt-',n) 


440 A. Summary of Basic Formulae 


Normal Model (both parameters unknown) 


z= {ry,....2,}. —x<rj<™® 
p(ay | eA) = N(x; | we A), -~x<p<x. ADO 


i(z)= (2.4), nee Vitis ons = 0 i Hah 


p(t | 0A) = N(#| nA) 
p(ns? | wd) = Ga(ns?|4(n — 1). 4A). p(Ans*) = \?(Ans? |r = 1) 


p(t. A) = Ng(u. A | to. ro, a, 3) = N(ge| p49. M90A) Gal | a. 3) 
p(w) = St(pe| po. mypad”!. 2a) 

p{A) = Ga(A|a.3) 

P(r) = St(x | fo. ro(me + 1)" ad"! 2a) 

p(F) = oe | fu. Nor (ne + nya 3°) 2a) 

p(ns*) = Gg(ns* |a. 23. | ght) 

P(e | ae = St(e| Mn. (n+ ny)(a + $n), 1. 2a +n). 


Hy = (ro +r) '(rypig + 0F), 
By = 3+ bns* + dno +n) non(po — FP 
PWA|z)} = Ga(Ala + $n.3,) 
p(a|z) = St(r | py. (rn + no) + no + L'a + 51). 3; ' 2a +n) 


m(gt.A) = aA.) x AT. n>1 

m(pe[ Zz) = St(pe] (mn - L)s7?.n - 1) 

m(A|z) = Ga(A| 5 (rn — 1), $s?) 
m(r|z) = St(r[a.(n - 1)(n + L7!s7?.n = 1) 
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Multinomial Model 


z={ri,....7e, nr}, r, =0,1,2,..., Drea <n 
p(z|@) =Mu(z|O,n), O0<6<1, Y*,4<1 


t(z)=(r.n), r=(ry,.-.57k) 
p(r|@) = Mu,(r | 8,2) 


p(@) =Dix(@|a), a@={ay,....aK41} 
p(r) = Md,(r | a, n) 
p(8|z) = Diz Ors Hy. OR FTE, ky HN = r) 


p(a|z) = Md, (za HT Opt TE Ong ENR Thai ren) 


Multivariate Normal Model 


z= {a,...,2n}, x, € Re 
p(w; |,A) =Ny(ai|m.A), weR*, XA kx k positive-definite 
t(z) = (2, S), a a SE S =i, (xi — (x; - £)' 
P(® | Hy A) = Nu (@| nA) 
p(S|A) = Wix (S| 5(n - 1), 5) 
p(M. d) = Nw: (4, x | Ho, No, a, B) i Ni (pu | Hoy NA) Wi (A | Qa, B) 
p(x) = Sty (@| Ho, (My + 1) !no(a — (kK — 1))B7'.2a-—k +1) 
P(t | 2) = St (1 | My, (2 + 20)anB,", Zan) , 

Hy = (Mm +7)7'(nopo + nz), 

8, =B+ i) + 3(n + no)7 nno( py — EZ) (ty — Z)! 
p(A| z) = Wik(Ala + 37, B,) 


p(z|z) = Ste(ax | p,, (mo +2 + 1)! (mp + n)on Bz", 20n), 
On = a+ 5n— 3(k—1) 
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Linear Regression 


z= (y. X), y=(Yee-. Yn)! € RK". w = (T..--.- riz) € RX = (t,,) 
ply|X.0-4) =Naly|XO.A,). OER. ADO 


t(z) = (X'X.X'y) 


p(O, X) = Ng(0.A| Op, 2.0.3) = Nx (8 | Oo. nA) Gal A | a. 3) 
PO) = St.(O| On. mna3-!. 2a) 
p(A) = Ga(A| a, 3) 
P(y|x) = St(y| xO. f(x)ad~'. 2a) 
f(z) =1-2(2'x + ny)"'2', 
p(O|z) = St, (0|6,,. (mo + X'X (at in) dy! 2atn). 
6, = (ny + XX) "(no + X*y). 
3, = 3+ Hy — X,)'y + 40 — 0)! 04 
p(A|z) = Ga(A|at $n. 3,) 
p(y|a.z) = St(y|2O,. frla)(a t+ 5n)37'. 2a tn). 
fr(w) =1-—a2(2'x + ny + X'X)"'2' 


7(O,X) = 7(A.0) x AW"? (for all reorderings of the 6;) 
n(O|z) = St, (0 |6,. 1X'X(n — k)3e'. n—k). 

6, = (8) Ixty, 

Br iy - X6,)'y 
a(A|z) = Ga(A| 4(n — k).3,) 


m(y|a.z) = St(y|xO,. 4f,(x)(n — k) Ba" n—k), 
Ffr(x Sh ae 
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Appendix B 


Non-Bayesian 
Theories 


Summary 

A summary is given of a number of non-Bayesian statistical approaches and 
procedures. The main theories reviewed include classical decision theory, fre- 
quentism, likelihood, and fiducial inference. These are illustrated and compared 
and contrasted with Bayesian methods in the stylised contexts of point and in- 
terval estimation, and hypothesis and significance testing. Further issues dis- 
cussed include: conditional and unconditional inferences; nuisance parameters 
and marginalisation; prediction; asymptotics and criteria for mode] choice. 


B.1 OVERVIEW 


Bayesian statistical theory as presented in this book is self-contained and can be 
understood and applied without reference to alternative statistical theories. There 
are, however, two broad reasons why we think it appropriate to give a summary 
overview of our attitude to other theories. 

First, many, if not most, readers will have some previous exposure to “classi- 
cal” statistics, and the material in this Appendix may help them to put the contents 
of this book into perspective. 
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Secondly, our own experience has been that some element of comparative anal- 
ysis contributes significantly to an appreciation of the attractions of the Bayesian 
paradigm in statistics. 

Asa preliminary, we recall from Chapter | our acknowledgment that Bayesian 
analysis takes place in a rather formal framework, and that exploratory data analysis 
and graphical displays are often prerequisite, informal activities. It is important. 
therefore, to be clear that in this Appendix we are discussing non-Bayesian formal 
procedures. 

We begin by making explicit some of the key differences between Bayesian 
and non-Bayesian theories: 


(i) As we showed in detail in Chapter 2, Bayesian statistics has an axiomatic 
foundation which guarantees quantitative coherence. Non-Bayesian statistical 
theories typically lack foundational support of this kind and essentially consist 
of a set of recipes which are not necessarily internally consistent. 

(ii) Non-Bayesian theories typically use only a parametric model family of the 
form {p(x|@). 2 € X,@ € O}. ignoring the prior distribution p(@). The 
implications of this fact are so far reaching that sometimes Bayesian statistics 
is simplistically thought of as statistics with the “optional extra” of a prior 
distribution. In Chapters 2 and 3, we have argued that the existence of a prior 
distribution is a mathematical consequence of the foundational axioms. In 
Chapter 4, we stressed that predictive models, typically derived from combin- 
ing p(x | @) and p(@). are primary. 

(iii) The decision theoretical foundations of Bayesian statistics provide a natural 
framework within which specific problems can easily be structured, with so- 
lutions directly tailored to problems. In contrast. most non-Bayesian theories 
essentially consist of stylised procedures, such as those for point or interval 
estimation, or hypothesis testing. designed to satisfy or optimise an ad hoc 
criterion, and often lacking the necessary flexibility to be adaptable to specific 
problem situations. 

(iv) We have argued that, from a Bayesian viewpoint, a decision structure is the 
natural framework for any formal statistical problem, and have described how 
a “pure” inference problem may be seen as a particular decision problem. 
Non-Bayesian theories depart radically from this viewpoint; classical deci- 
sion theory is only partially relevant to inference, and non-Bayesian inference 
theories typically ignore the decision aspects of inference problems. 


In Section B.2, we will revise the key ideas of a number of non-Bayesian 
statistical theories, specifically reviewing Classical Decision Theory, Frequentist 
Procedures, Likelihood Inference and Fiducial and Related Theories. 
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In Section B.3, we will follow the typical methodological partition of non- 
Bayesian textbooks into the topics of Point Estimation, Interval Estimation, Hy- 
pothesis Testing and Significance Testing. Within each of those subheadings we 
will comment on the internal logic, the relevance to actual statistical problems, and 
the performance of classical procedures relative to their Bayesian counterparts. 

In Section B.4, we will discuss in detail some key comparative issues: Con- 
ditional and Unconditional Inference, Nuisance Parameters and Marginalisation, 
Approaches to Prediction, Aspects of Asymptotics and Model Choice Criteria. 

For readers seeking further comparative discussion at textbook level, we recall, 
from our discussion at the end of Chapter 5, the books by Barnett (1971/1982), Press 
(1972/1982), Cox and Hinkley (1974), Anderson (1984), DeGroot (1987), Casella 
and Berger (1990) and Poirier (1993). 


B.2. ALTERNATIVE APPROACHES 
B.2.1 Classical Decision Theory 


We recall from Section 3.3 the basic structure of a general decision problem, con- 
sisting of a set of a possible decisions D, a parameter space Q, a prior distribution 
p(w) over ©, and a utility function u(d(w)) which we shall denote by u(d. w) to 
conform more closely to standard notation in classical decision theory. We estab- 
lished that the existence of both the prior distribution p(w) and the utility function 
u(d, w) is a mathematical consequence of the axioms of quantitative coherence and 
that the best decision d* is that which maximises the expected utility 


u(d) = [ve p(w)ds. 


We established furthermore that, if additional information & is obtained which 
is probabilistically related to w by p(a | w), then the best decision d*. is that which 
maximises the posterior expected utility 


u(d|a) = J vdwp(w| 2) 


where 
p(w |x) x p(x|w)p(w). 


Some authors prefer to use loss functions instead of utilities. A regret function, 
or decision loss, is easily defined from the utility function (at least in bounded cases) 
by 

i(d,w) = sup u(d;,w) — u(d,w). 


djeD 
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which quantifies the maximum loss that, for each w, one may suffer as a conse- 
quence of a wrong decision. Since supp u(d.w) only depends on w. the expected 
loss 


Kd) = [Maori 


is minimised by the same decision d* which maximises u(d) and hence, from a 
Bayesian point of view, the two formulations are essentially equivalent. 

In contrast to this Bayesian formulation, the core framework of classical deci- 
sion theory may be loosely described as decision theory without a prior distribution. 
A utility function (or a loss function) is accepted, perhaps justified by utility-only 
axiomatics of the type pioneered by von Neumann and Morgenstern (1944/1953). 
but a prior distribution for w is not. 

Although some of the basic ideas in classical statistical theory were present 
in the work of Neyman and Pearson (1933), it was Wald (1950) who introduced 
a systematic decision theory framework, excluding prior distributions as core ele- 
ments, but including a formulation of standard statistical problems within a decision 
framework. This work was continued by Girshick and Savage (1951) and by Stein 
(1956). An excellent textbook introduction is that of Ferguson (1967). 

Classical decision theory focuses on the way in which additional information 
x should be used to assist the decision process. Consequently. the basic space 
is not the class of decisions. . but the class of decision rules. A. consisting of 
functions 6 : X — 1) which attach a decision 6(x) to each possible data set x. It 
is then suggested that decision rules should be evaluated in terms of their average 
loss with respect to the data which might arise. Thus, the risk function r(d...)} of a 
decision rule 4 is defined as 


r(d.w) = [ '6(e).w\p(e|w\de 


and subsequent comparison of decision rules is based on their risk functions. 
The formulation includes, as a special case. the situation with no additional 
data (the no-data case). where the risk function reduces to the loss function. 


Example B.1. (Estimation of the mean of a normal! distribution). 
Leta@ = {a4y...... r,,} be arandom sample from a N(.r: gr. 1) distribution, and suppose that 
we want to select an estimator for j. so that J) = ‘R. under the assumption of a quadratic 
loss function I(d. 41) = (4 — d)*. Some possible decision rules are 
(i) d(x) = F, the sample mean 
(ii) d2(x) = #, the sample median 
(iii) 6s(@) = yo. a fixed value 
(iv) 6y(@) = (7h = ra) ' (rap ~ 208), the posterior mean from an NGC! pi. .) prior, 
centred on jt, and with precision iy. 
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Using the fact that the variance of the sample median is approximately /2n, the 
corresponding risk functions are easily seen to be 
(i) r(d,,.4) =1/n 
(ii) r(69. 2) x 1/2n, 
(iii) (65. ) = (42 — Ho)? 
(iv) (54.4) = (n + no)? {n + nia — #)?} 


r(53. 2) 


7(d1.) 


(db, ps) 
(51.44) 


-3 -2 -1 1 2 3 


Figure B.1 Risk functions for jy = 0,n = 10 and ng = 5 


Note that (iv) includes both (i) and (iii) as limiting cases when 7, — 0 and ny — oc 
respectively. Figure B.1 provides a graphical comparison of the risk functions. 

It is easily seen that, whatever the value of j1. 62 has larger risk than 0, but, otherwise. 
the best decision rule markedly depends on j:. The closer jz, is to the true value of j, the 
more attractive 4; and 4, will obviously be. 


Admissibility 


The decision rule 52 in Example B.1! can hardly be considered a good decision rule 
since, for any value of the unknown parameter yz, the rule 6, has a smaller risk. 
This is formalised within classical decision theory by saying that a decision rule 6’ 
is dominated by another decision rule 6 if, for all w, 


r(6’,w) > r(d,w) 
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with strict inequality for some w, and that a decision rule is admissible if there is 
no other decision rule which dominates it. A class of decision rules is complete if 
for any 0’ not in the class there is a 6 in the class which dominates it, and a class 
is minimal complete if it does not contain a complete subclass. If one is to choose 
among decision rules in terms of their risk functions, classical decision theory 
establishes that one can limit attention to a minimal complete class. However. for 
guidance on how to choose among admissible decision rules. further concepts and 
criteria are required. 


Bayes Rules 


If the existence of a prior distribution p(w) for the unknown parameter is accepted. 
classical decision theory focuses on the decision rule which minimises expected 
tisk (or so-called Bayes risk) 


min [ r(d,w) p(w) dw = min f H(d(av).w) p(x | w) p(w) da dw. 
0 QIN 


a) Nm, 


which it calls a Bayes decision rule. Note that. since 
p(x |w)p(w) = p(w | x)p(x). 


under appropriate regularity conditions we may reverse the order of integration 
above to obtain 


min f [ (6(2).e\pleo|@)p(e) dards 
o IQIN 
= i ple) min f 1(d.w)p(w |e) dod. 
X * IQ 


so that the Bayes rule may be simply described as the decision rule which maps 
each data set x to the decision 6* (a) which minimises the corresponding posterior 
expected loss. Note that this interpretation does nor require the evaluation of any 
risk function. 

It is easily shown that any Bayes rule which corresponds to @ proper prior 
distribution is admissible. Indeed, if 4” is the Bayes decision rule which corresponds 
to p(w) and 6’ were another decision rule such that r(6’.w) < r(d*.w) with strict 
inequality on some subset of Q with positive probability under p(w), then one 
would have 


| (6',w)p(w)dw < / r(4*. w) p(w)dw. 


which would contradict the definition of a Bayes rule as one which minimises 
the expected risk. Wald (1950) proved the important converse result that, under 
rather general conditions. any admissible decision rule is a Baves decision rule 
with respect to some, possibly improper, prior distribution. 
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There is, however, no guarantee that improper priors lead to admissible decision 
tules. A famous example is the inadmissibility of the sample mean of multivariate 
norma! data as an estimator of the population mean, even though it is the Bayes 
estimator which corresponds to a uniform prior. For details, see James and Stein 
(1961). 


Minimax rules 


The combined facts that admissible rules must be Bayes, and that to derive the Bayes 
rule does not require computation of the risk function but simply the minimisation 
of the posterior expected loss, make it clear that, apart from purely mathematical 
interest, it is rather pointless to work in decision theory outside the Bayesian frame- 
work. Indeed, this has been the mainstream view since the early 1960’s, with the 
authoritative monographs by DeGroot (1970) and Berger (1985a) becoming the 
most widely used decision theory texts. 

Nevertheless, some textbooks continue to propose as a criterion for choosing 
among decisions (without using a prior distribution) the rather unappealing minimax 
principle. This asserts that one should choose that decision (or decision rule) for 
which the maximum possible loss (or risk) is minimal. It can be shown, under rather 
general conditions, and certainly in the finite spaces of real world applications, 
that the minimax rule is the Bayes decision rule which corresponds to the least 
favourable prior distribution, i.e., that which gives the highest expected risk. 

The intuitive basis of the minimax principle is that one should guard against the 
largest possible loss. While this may have some value in the context of game theory, 
where a player may expect the opponent to try to put him or her in the worst possible 
situation, it has no obvious intuitive merit in standard decision problems. The idea 
that the minimax rule should be preferred to a rule which has better properties 
for nearly all plausible w values, but has a slightly higher maximum risk for an 
extremely unlikely w value seems absurd. Moreover, even as a formal decision 
criterion, minimax has very unattractive features; for instance, it gives different 
answers if applied to losses rather than to regret functions, and it can violate the 
transitivity of preferences (see e.g., Lindley, 1972). 

Thus, although in specific instances—namely when prior beliefs happen to 
be close to the least favourable distribution— the minimax solution may be reason- 
able (essentially coinciding with the Bayes solution), the minimax criterion seems 
entirely unreasonable. 


B.2.2 Frequentist Procedures 


We recall from Section S. | the basic structure of a stylised inference problem, where 
inferences about 8 € © are to be drawn from data x, probabilistically related to 8 
by the parametric model component {p(x | @),@ € O}. 
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We established that the existence of a prior distribution p(@) is a mathematical 
consequence of the axioms of quantitative coherence, and that the required inferen- 
tial statement about @ given x is simply provided by the full posterior distribution 


p(@| x) = p(x|@)p(@)/p(2x). 


where 


oe i pla |6)p(0)d0. 


Frequentist statistical procedures are mainly distinguished by two related fea- 
tures; (i) they regard the information provided by the data z as the sole quantifiable 
form of relevant probabilistic information and (ii) they use. as a basis for both 
the construction and the assessment of statistical procedures, long-run frequency 
behaviour under hypothetical repetition of similar circumstances. 

Although some of the ideas probably date back to the early 1800’s, most 
of the basic concepts were brought together in the 1930’s from two somewhat 
different perspectives, the work of Neyman and Pearson, being critically opposed 
by Fisher, as reflected in discussions at the time published in the Royal Statistical 
Society journals. Convenient references are Neyman and Pearson (1967) and Fisher 
(1990). See, also, Wald (1947) for specific methods for sequential problems. 

Frequentist procedures make extensive use of the likelihood function 


lik(@ | x) = p(x | @) 


(or variants thereof), essentially taking the mathematical form of the sampling 
distribution of the observed data x and considering it as a function of the unknown 
parameter @. If z = z(a) is a one-to-one transformation of 2, the likelihood in 
terms of the sampling distribution of z becomes (in the above variant) 


a . 
lik(@ | 2) = ple 18) = phar 6) | =| = wi(@ 2) |= 


which suggests that meaningful likelihood comparisons should be made in the form 
of ratios rather than. say. differences. in order for such comparisons not to depend 
on the use of z rather than z. 

The basic ideas behind frequentist statistics consist of (i) selecting a function 
of the data £ = t(a), called a statistic, which is related to the parameter @ in a 
convenient way. (ii) deriving the sampling distribution of f, i-e., the conditional 
distribution p({é|@), and (iii) measuring the “plausibility” of each possible @ by 
calibrating the observed value of the statistic £ against its expected long-run be- 
haviour given @, described by p(€|@). For a specific parameter value @ = Op, if 
the observed value of ¢ is well within the area where most of the probability density 
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of p(t | Ao) lies, then @p is claimed to be compatible with the data; otherwise it is 
said that either Qo is not the true value of @, or a rare event has happened. 

Such an approach is clearly far removed from the (to a Bayesian rather intu- 
itively obvious) view that relevant inferences about @ should be probability state- 
ments about 6 given the observed data, rather than probability statements about 
hypothetical repetitions of the data conditional on (fhe unknown) @. This contrast 
is highlighted by the following example taken from Jaynes (1976). 


Example B.2. (Cauchy observations). Let x = {.r;, r2} consist of two independent 
observations from a Cauchy distribution p(x | @) = St(z | @, 1, 1). Common sense (supported 
by translational and permutational symmetry arguments) suggests that @ = (x, + 72)/2 may 
be a sensible estimate of 6. Yet, the sampling distribution of 6 is again St(x | 6, 1, 1) so that, 
according to a naive frequentist, it cannot make any difference whether one uses xr), 22 OF 
# to estimate 8. Clearly, there is more to inference than the choice of estimators and their 
assessment on the basis of sampling distributions. 


Sufficiency 

We recall from Section 4.5 thata statistic t is sufficient if p(x | t,0) = p(x | t); i.e., if 
the conditional distribution of the data given t is independent of 6 (Proposition 4.11), 
and that a necessary and sufficient condition for t to be sufficient for 0 is that the 
likelihood function may be factorised as 


lik(@ |x) = p(w|6) = A(t, 0)g(z), 


in which case, for any prior p(@), the posterior distribution of @ only depends on 
x through ¢, i.e., p(@| x) = p(@|t). The concept of sufficiency in the presence 
of nuisance parameters is controversial; see, for example, Cano et al. (1988) and 
references therein. 


The sufficiency principle in classical statistics essentially states that, for any 
given model p(x |@) with sufficient statistic t, identical conclusions should be 
drawn from data x, and x2 with the same value of t. The idea was introduced by 
Fisher (1922) and developed mathematically by Halmos and Savage (1949) and 
Bahadur (1954). 

From a Bayesian viewpoint there is obviously nothing new in this “principle”; 
it is a simple mathematical consequence of Bayes’ theorem. 

However, from a “textbook” perspective, other frequentist developments of 
the sufficiency concept have little or no interest from a Bayesian perspective. For 
example: a sufficient statistic t is complete if for all @ in ©, f h(t)p(t | @)dt = 0 
implies that h(t) = 0. The property of completeness guarantees the uniqueness of 
certain frequentist statistical procedures based on t, but otherwise seems inconse- 
quential. 
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Ancillarity 

In Section 5.1 we demonstrated how a sufficient statistic £ = ¢(a) may often 
be partitioned into two component statistics t(2) = [a(a).s(a)} such that the 
sampling distribution of a(x) is independent of 6. We defined such an a(x) to be 
an ancillary statistic and showed that, if a is ancillary, then 


p(B |x) = p(O|t) x p(s|.a.@)p(A) 


so that, in the inferential process described by Bayes’ theorem, it suffices to work 
conditionally on the value of the ancillary statistic. For further information, sce 
Basu (1959). 

The conditionality principle in classical statistics states that, whenever there 
is an ancillary statistic a, the conclusions about the parameter should be drawn as 
if a were fixed at its observed value. The apparent need for such a principle in 
frequentist procedures is well illustrated in the following simple example. 


Example B.3. (Conditional versus unconditional arguments). A 0-\ signal comes 
from one of two sources 4, or 6, and there are two receivers #, and J, such that 


pir = 0] R,.8)) = pr = 1] Ry. 6) = 0.9 


pir =O] Re. @)) = por = 1) Ry. 82) = 0.2 


where the receiver is selected at random. with p(/?,) = 0.99. If #2. were the receiver and 
a = 1 were obtained, the conditional likelihood would have been 


lik(O) | Ry =I =O08. — lik(@| Row = 1) = 0.2. 


suggesting 4, as the true value of @. On the other hand, the unconditional likelihood given 
t= 1 would have been 


lik(@, |v = 1) =U.107. ——ik(@ jr = 1) = ORB. 


suggesting #2 instead. The conflict arises because the latter (unconditional) argument takes 
undue account of what might have happened (i.e.. 22, might have been the receiver) but did 
not, 


A further example regarding ancillarity is provided by reconsidering Exam- 
ple B.2. The difficulty in this case disappears if one works conditionally on the 
ancillary statistic |r} — r2 | /2. 

These examples serve to underline the obvious appeal of a trivial consequence 
of Bayes’ theorem: namely, that one should always condition inferences on what- 
ever information is available; the conditionality “principle” is just a small ad hoc 
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step towards this rather obvious desideratum (which is, in any case, “automatic” in 
the Bayesian approach). 

From a frequentist viewpoint, however, the conditionality “principle” is not 
necessarily easy to apply, since ancillary statistics are not readily identified, and are 
not necessarily unique. Moreover, applying the conditionality principle may leave 
the frequentist statistician in an impasse. For example, Basu (1964) noted that if x 
is uniform on (9, 1+], then the fractional part of x is uniformly distributed on (0, 1[ 
and hence ancillary, but the conditional distribution of z given its fractional part is 
auseless one-point distribution! See Basu (1992) for further elegant demonstration 
of the difficulties with ancillary statistics in the frequentist approach. 


The Repeated Sampling Principle 


A weak version of the repeated sampling principle states that one should not follow 
statistical procedures which, for some possible value of the parameter, would too 
frequently give misleading conclusions in hypothetical repetitions. Although this 
is too vague a formulation on which to base a formal critique, it can be used to 
criticise specific solutions to concrete problems. 

A much stronger version of this “principle”, whose essence is at the heart of 
frequentist statistics, states that statistical procedures have to be assessed by their 
performance in hypothetical repetitions under identical conditions. This implies 
that (i) measures of uncertainty have to be interpreted as long-run hypothetical 
frequencies, that (ii) optimality criteria have to be defined in terms of long-run be- 
haviour under hypothetical repetitions and, that (iii) there are no means of assessing 
any finite-sample realised accuracy of the procedures. 


Example B.4. (Confidence versus HPD intervals). Let x = {x,,...,.2,,} be a ran- 
dom sample from N(z | 2.1). It is easily seen that Z is a sufficient statistic, whose sampling 
distribution is N(% | jz.), a normal distribution centred at the true value of the parameter, 
with precision n. Since the sampling distribution of ¥ concentrates around jz, one might 
expect & to be close to j on the basis of a large number of hypothetical repetitions of the 
sample, so that £ suggests itself as an estimator of 1. Moreover, conditional on j:, 


Pltept 1.96/ /n| 2] = 0.95 


so that, if we define a statistical procedure to consist of producing the interval F + 1.96/ /n 
whenever a random sample of size n from N(:r| jz, 1) is obtained, we are producing an 
interval which will include the true value of the parameter 95% of the time, in the long run. 
Note that this says nothing about the probability that ;: belongs to that interval for any given 
sample. In contrast, the superficially similar statement 


P (pe x + 1.96//n|F] = 0.95 


which is derived from the reference posterior distribution of jz given X, explicitly says that 
given Z, the degree of belief is 0.95 that j: belongs to ¥ + 1.96,/n, and is not concerned at 
all with hypothetical repetitions of the experiment. 
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Invariance 


If a parametric model p({a|@) is such that two different data sets, x, and x». 
have the same distribution for every @, then both the likelihood principle and the 
mechanics of Bayes’ theorem imply that one should derive the same conclusions 
about 6 from observing 2, as from observing 22. 

A more elaborate form of invariance principle involves transformations of 
both the sample and the parameter spaces. Suppose that, with X = 0, for all the 
elements of a group G of transformations there is a unique transformation g such 
that y(O) = Oand p(x |) = p(g(x) | g(@)). Then the invariance principle would 
require the conclusions about @ drawn from the statistic t(g(a)) to be the same 
as those drawn from g(f(a)). For example, in estimating 9 € R from a location 
model p({.c | @) = h(c — 8) it may be natural to consider the group of translations. 
In this case, g(.c) = 2 +a,a € R, and g(@) = 6+. The invariance principle then 
requires that any estimate t(a) of 9 should satisfy t(a + a1) = f(a) + a. where 1 
is a vector of unities. 

Note that the argument only works if there is no reason to believe that some 0 
values are more likely than others. From a Bayesian point of view, for invariance to 
be a relevant notion it must be true that the transformation involved also applies to the 
prior distribution (otherwise, one may have a uniform loss of expected utility from 
following the invariance principle). Another limitation to the practical usefulness 
of invariance ideas is the condition that g(@) = ©. Thus. in the location/translation 
example, the invariance principle could not be applied if it were known that @ > 0. 

A final general comment. Frequentist procedures centre their attention on 
producing inference statements about unobservable parameters. As we shall see in 
Section B.4, such an approach typically fails to produce a sensible solution to the 
more fundamental problem of predicting future observations. 


B.2.3 Likelihood inference 


We recall from Section 5.1 the following trivial consequence of Bayes’ theorem. 
Consider two experiments yielding. respectively. data x and z and with model rep- 
resentation involving the same parameter @ € Q. the same prior. and proportional 
likelihoods, so that 

p(x |O) = h(x. z)p(z | 0). 


Then the experiments produce the same conclusions about @, since they induce 
the samc posterior distribution. The likelihood principle suggests that this should 
indeed be the case, for the relative support given by the two sets of data to the 
possible values of @ is precisely the same. 

Frequentist procedures typically violate the likelihood principle, since long 
run behaviour under hypothetical repetitions depends on the entire distribution 
{p(x |). a € X} and not only on the likelihood. 
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As mentioned before, when common priors are used across models with pro- 
portional likelihoods, the Bayesian approach automatically obeys the likelihood 
principle and certainly accepts the likelihood function as a complete summary of 
the information provided by the data about the parameter of interest. With a uniform 
prior, the posterior distribution is, of course, proportional to the likelihood function. 
Proponents of the likelihood approach to inference go further, however, in their uses 
of the likelihood function, in that they regard it not only as the sole expression of 
the relevant information, but also as a meaningful relative numerical measure of 
support for different possible models, or for alternative parameter values within the 
same model. The basic ideas of this pure likelihood approach were established by 
Barnard (1949, 1963), Barnard et al. (1962), Birnbaum (1962, 1968, 1972) and Ed- 
wards (1972/1992). They essentially argue that (i) the likelihood function conveys 
all the information provided by a set of data about the relative plausibility of any 
two different possible values of @ and (ii) the ratio of the likelihood at two different 
6 values may be interpreted as a measure of the strength of evidence in favour of 
one value relative to the other. 

Both claims make sense from a Bayesian point of view when there are no 
nuisance parameters. Indeed, (i) is just a restatement of the likelihood principle 
and, moreover, it follows from Bayes’ theorem that 

p(@i|z) _ p(z| 1) p(A1) 


p(O2|x) p(w] 2) p(O2) ° 


so that the likelihood ratio satisfies (ii), since it is the factor which modifies prior 
odds into posterior odds. 

However, the pure likelihood approach, i.e., the attempt to produce inferences 
solely based on the likelihood function, breaks down immediately when there are 
nuisance parameters. The use of “marginal likelihoods” necessarily requires the 
elimination of nuisance parameters, but the suggested procedures for doing this 
seem hard to justify in terms of the likelihood approach. For early attempts, see 
Kalbfleish and Sprott (1970, 1973) and Andersen (1970, 1973). In recent years, 
work has focused on the properties of profile likelihood and its variants. Useful ref- 
erences include: Barnard and Sprott (1968), Barndorff-Nielsen (1980, 1983, 1991), 
Butler (1986), Davison (1986), Cox and Reid (1987, 1992), Cox (1988), Fraser 
and Reid (1989), Bjrnstad (1990) and Monahan and Boos (1992). Other references 
relevant to the interface between likelihood inference and Bayesian statistics in- 
clude Hartigan (1967), Plante (1971), Akaike (1980b), Pereira and Lindley (1987), 
Bickel and Ghosh (1990), Goldstein and Howard (1991) and Royall (1992). See, 
also, Section 5.5.1 for a link with Laplace approximations of posterior densities. 

For further information on the history of likelihood, see Edwards (1974). 

The likelihood approach can also conflict with the weak repeated sampling 
principle, in that examples exist where, for some possible parameter values, hypo- 
thetical repetitions result in mostly misleading conclusions. Frequentist statistics 
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solves the difficulty by comparing the observed likelihood function with the distri- 
bution of the likelihood functions which could have been obtained in hypothetical 
repetitions; Bayesian statistics solves the problem by working, not with the likeli- 
hood function, but with the posterior distribution defined as the weighted average 
of the likelihood function with respect to the prior. The following example is due 
to Birnbaum (1969). 


Example B.5. (Naive likelihood versus reference analysis). Consider the model 
Pir [Are {l2.... 100}. € {0.1.2..... 100}, where, for.r = 1.2.....100, 


p0r ji 4 = 0) = 1/100. 
pir (6 #0) = [ym (r) 


Then. whatever . is observed, if @ = 0 the likelihood of the true value is always ] /1 00th of 
the likelihood of the only other possible # value. namely 4 =... 

From a Bayesian point of view, the answer obviously depends on the prior distribution. 
If all @ are judged a priori to have the same probability. then one certainly has 


pd =Ofr) = 1/101 
WE = rir) = 1OO/IOL. a 40. 


However, if, say. 9 = 0 is considered to be special. as might well be the case in any real 
application of such a model, and is declared to be the parameter of interest, then the reference 
prior turns out to be 


pPO=O0)=1/2 pP=r)=1/200. r=... 100. 


and a straightforward calculation reveals that this is also the posterior. given a single obser- 
vation, , Thus, with this prior, one observation from the model provides no information. 
Of course, for any prior. a second observation would, with high probability, reveal the true 
value of 6. 


Finally, as we shall discuss further in Section B.4, we note that, like frequentist 
procedures, the likelihood approach has difficulties in producing an agreed solution 
to prediction problems. 


B.2.4 Fiducial and Related Theories 


We noted in Section B.2.3 that frequentist approaches are inherently unable to pro- 
duce probability statements about the parameter of interest conditional on the data, 
a form of inference summary that seems most intuitively useful. This fact, coupled 
with the seeming aversion of most statisticians to the use of prior distributions. 
has led to a number of attempts to produce “posterior” distributions without using 
priors. We now review some of those proposals. 
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Fiducial Inference 


In a series of papers published in the thirties, Fisher (1930, 1933, 1935, 1939) de- 
veloped, through a series of examples and without any formal structure or theory, 
what he termed the fiducial argument. Essentially, he proposed using the distribu- 
tion function F(t | @) of a sufficient estimator t € T for @ € © in order to make 
conditional probability statements about @ given t, thus somehow transferring the 
probability measure from T to 0. However, no formal justification was offered for 
this controversial “transfer”. 

The basic characteristics of the argument may be described as follows. Let 
p(x |@),@ € (A0,6,) C R be a one-dimensional parametric model and let t = t(x) 
be a sufficient statistic for 8. Suppose further that the distribution function of t, 
F(t |), is monotone decreasing in 9, with F(t | 9) = 1 and F(t| 61) = 0. Then, 
G(6|t) = 1 — F(t|9) has the mathematical properties of a distribution function 
over (, 91) and, hence, 


f(0|t) =~ Fitl0) 


has the mathematical structure of a “posterior density” for 0. This is the fiducial 
distribution of 9, as proposed by Fisher (1930, 1956/1973). The argumentis trivially 
modified if F(t | @) is monotone increasing in 0, by using G(@|t) = F(t|@). 


Example B.6. (Fiducial and reference distributions). Let x = {z)..... an} bea 
random sample from an exponential distribution p(z |@) = Ex(x|6) = 6e~*". with mean 
6°'. It is easily verified that Z is a sufficient statistic for 9, and has a distribution function 


1 nte 
a = w-lo-t 


which is monotone increasing in 9. Hence. G(@ |Z) = F(Z |@) is monotone increasing from 
0 to | as 6 ranges over (0. 0c), and the fiducial distribution of 6 is obtained as 


f(@\z) = SFtel6) = moa 20" —ns8 


Note that this has the form f(0|Z) x p(x |)m(8), with 7(9) = 0''. Since 7(6) = 9~' is 
in this case the reference prior for 9, it follows that, in this example, the fiducial! distribution 
coincides with the reference posterior distribution. 


This last example suggests that the fiducial argument might simply be a re- 
expression of Bayesian inference with some appropriately chosen “non-informa- 
tive” prior. However, Lindley (1958) established that this is true if, and only if, 
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the probability model p(z |) is such that x and @ may separately be transformed 
SO as to obtain a new parameter which is a /ocation parameter for the transformed 
variable. See Seidenfeld (1992) for further discussion. 

In one-dimensional problems, the fiducial argument, when applicable, is more 
or less well defined and often produces reasonable answers, which are nevertheless 
far better justified from a Bayesian reference analysis viewpoint. However, it is 
by no means clear—and, in fact, a matter of considerable controversy —how the 
argument might be extended to multiparameter problems. The Royal Statistical 
Society discussions following the papers by Fieller (1954) and Wilkinson (1977) 
serve to illustrate the difficulties. Other relevant references are Brillinger (1962) 
and Barnard (1963). 

From a modern perspective, the fiducial argument seems now to have at most 
historical interest, and that mainly due to the perceived stature of its proponent. As 
Good (1971) puts it 


... if we do not examine the fiducial argument carefully, it seems almost incon- 
ceivable that Fisher should have made the error which he did in fact make. It is 
because (i) it seemed so unlikely that a man of his stature should persist in the 
error, and (ii) because, as he modestly says, his 1930 ‘explanation left a good 
deal to be desired, that so many people assumed for so long that the argument 
was correct. They lacked the daring to question it. 


See, however, Efron (1993) for a recent suggested modification of the fiducial 
distribution which may have better Bayesian properties. 


Pivotal Inference 


Suppose that, for a given model p(x | @). with sufficient statistic ¢, it is possible 
to find some function h(@,t) which is monotone increasing in @ for fixed t. and 
in t for fixed @, and which has a distribution which only depends on @ through 
h(@,t). Then, h(@,t) is called a pivotal function and the fiducial distribution of 
@ may simply be obtained by reinterpreting the probability distribution of / over 
T as a probability distribution over Q. Fisher's original argument. as described 
above, is a special case of this formulation, since G(@ | t) is a pivotal function with 
a uniform distribution over [0, 1], which is independent of 6. 

Barnard (1980b) has tried to extend this idea into a general approach to infer- 
ence. His basic idea is to produce statements derived from the distribution of an 
appropriately chosen pivotal function, possibly conditional on the observed values 
of an ancillary statistic a(x). 

Partitioning a pivotal function h(@.2) = [g(@. a). a(a)} to identify a possi- 
bly uniquely defined ancillary statistic a(a), and using the distribution of 9(@. x) 
conditional on the observed value of a(x), does produce some interesting results 
in multiparameter problems where the standard fiducial argument fails. However. 
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the mechanism by which the probability measure is transferred from the sample 
space to the parameter space remains without foundational justification, and the 
argument is limited to the availability—by no means obvious—of an appropriate 
pivotal function for the envisaged problem. 


Structural Inference 


Yet another attempt at justifying the transfer of the probability measure from the 
sample space into the parameter space is the structural approach proposed by Fraser 
(1968, 1972, 1979). 

Fraser claimed that one often knows more about the relationship between data 
and parameters than that described by the standard parametric model p(a | @). He 
proposes the specification of what he terms a structural model, having two parts: 
a structural equation, which relates data x and parameter 6 to some error variable 
e; and a probability distribution for e which is assumed known, and independent 
of 8. Thus, the observed variable x is seen as a transformation of the error e, the 
transformation governed by the value of @. The key idea is then to reverse this 
relationship, and to interpret @ as a transformation of e governed by the observed 
x, so that @ in a sense “inherits” the probability distribution. 


Example B.7. (Structural and reference distributions). Let x = {z,.....2,,} bea 


set of independent measurements with unknown location 2 and scale a. If the errors have a 
known distribution p(e), the structural equation is 


Q,=ptoe, t=1,....7 


and the error distribution is, 


pe) = |] ples). 
i=} 


If p(e) is normal, this structural mode] may be reduced in terms of the sufficient statistics T 
and s? to the equations 
r=pt+oe, $=08, 


and error distributions 


@é=2z//n, z~N(z|0,1) 
8. = {w/(n—1)}"7, w~ xia. 
Reversing the probability relationship in the pivotal functions (n — 1)s?/o? ~ x?_, and 


Jn(z - 2)/s ~ St(t|0, 1, — 1) leads to structural distributions for 7 and yu which, as is 
often the case, coincide with the corresponding reference posterior distributions. 
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The general formulation of structural inference generalises the affine group 
structure underlying the last example, and considers a structural equation x = @e. 
to be interpreted as the response 2 generated by some transformation 8 € G in 
a group of transformations G, operating on a realised error e. with a completely 
identified error distribution for e. It is then claimed that 0°! (ax) has the same 
probability distribution e and. hence, this may be used to provide a structural 
distribution for 0. 

Here, the mechanism by which the probability measure on X is transferred to 
O is certainly well-defined in the presence of the group structure central to Fraser's 
argument. However, the group structure is fundamental and the approach seems to 
lack general validity and applicability. As Lindley (1969) puts it 


... Fraser's argument [is] an improvement upon and an extension of Fisher's in the 
special case where the group structure is present but [one should be] . . . suspicious 


of any argument, .... that only works in some situations. for inference is surely a 
whole. and the Poisson distribution [is] not basically different in character from 
... the normal. 


When the structural argument can be applied it produces answers which 
are mathematically closely related to Bayesian posterior distributions with “non- 
informative” priors derived from (group) invariance arguments. In fact, in most 
examples. the structural distributions are precisely the posterior distributions ob- 
tained by using as priors the right Haar measures associated with the structural 
group, which in turn, are special cases of reference posterior distributions (sce 
Villegas, 1977a, 1981, 1990; Dawid, 1983b, and references therein). 


B.3 STYLISED INFERENCE PROBLEMS 


B.3.1 Point Estimation 


Let {p(x |6),@ € O} bea fully specified parametric family of models and suppose 
that it is desired to calculate from the data 2 a single value O(a) € © represent- 
ing the “best estimate” of the unknown parameter 0. This is the so-called point 
estimation problem. Note that, in this formulation, the final answer is an element 
of ©, with no explicit recognition of the uncertainty involved. Pragmatically, a 
point estimate of 8 may be motivated as being the simplest possible summary of 
the inferences to be drawn from x about the value of @: alternatively, one may gen- 
uinely require a point estimate as the solution to a decision problem; for example, 
adjusting a control mechanism, or setting a stock level. 

We recall from Section 5.1.5 that. within the Bayesian framework, the problem 
of point estimation is naturally described as a decision problem where the set of 
possible answers to the inference problem. A. is the parameter space ©. Formally. 
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one specifies the loss function [(a, 8) which describes the decision maker’s prefer- 
ences in that context, and chooses as the (Bayes) estimate that value 8” (a) which 
minimises the posterior expected loss, 


/ I(a, 8)p(0 | x)d8, 


where 
p(@| x) x p(x | @)p(@). 


We have seen (Propositions 5.2 and 5.9) that intuitively natural solutions, such as 
the mean, mode or median of the posterior distribution of 9, are particular cases 
of this formulation for appropriately chosen loss functions. We also note that the 
definition of an optimal Bayesian estimator is constructive, in that it identifies a 
precise procedure for obtaining the required value. 


Classical decision theory ideas can obviously be applied to point estimation 
viewed as a decision problem. Thus, one may define admissible estimates, minimax 
estimates, etc., with respect to any particular loss function. From our perspective, 
the problems and limitations of classical decision theory that we identified in Sec- 
tion B.2.1 carry over to particular applications such as point estimation. Thus, 
admissible estimators are essentially Bayes estimators, but classical decision the- 
ory provides no foundationally justified procedure for choosing among admissible 
estimators, with—as we noted—the general minimax principle being unpalatable 
to most statisticians. 


The frequentist approach proceeds by defining possible desiderata of the long 
run behaviour of point estimators, and, using these desiderata as criteria, proposes 
methods for obtaining “best” estimators, and identifies conditions under which 
“good behaviour” will result. The criteria adopted are typically non-constructive. 


The likelihood approach proceeds by using the likelihood function to measure 
the strength with which the possible parameter values are supported by the data. 
Hence, the optimal estimator is naturally taken to be that @ which maximises the 
likelihood function. It is worth stressing that this is a constructive criterion, in 
that the very definition of a maximum likelihood estimator (MLE) determines its 
method of construction. 


Fiducial, pivotal and structural inference approaches all produce “posterior” 
probability distributions for 8. Hence, their “solution” to the problem of point 
estimation is essentially that suggested by the Bayesian approach; either to offer 
as an estimator of @ some location measure of the probability distribution of 0 
or, more formally, to obtain that value of @ which minimises some specified loss 
function with respect to such a distribution. 


462 B. Non-Bavesian Theories 


Criteria for Point Estimation 


It should be clear from Sections 5.1.4 and B.2.2 that the search for good estimators 
may safely be limited to those based on sufficient statistics, for then, and only then, 
is one certain to use all the relevant information about the parameter of interest. 
However, the following two points introduce a note of caution. 

(i) Sufficiency is a global concept; thus, if 6(x) is sufficient for 8, it does 
not follow that 0;(x) is sufficient for a component parameter 4;, even if @,(a) is 
sufficient for 9, when @ — {6;} is known. For instance, with univariate normal 
data (Z. s?) is jointly sufficient for (j.07), but F is not sufficient for j:, nor is s* 
sufficient for 0”. 

(ii) Sufficiency is a concept relative to a model; thus, even a small perturbation 
to the assumed model may destroy sufficiency. For example (%. s*) is not sufficient 
for (j2.0) if the true model is St (x |... 1000) or the mixture form 0.999 x 
N(x | 2,0) + 0.001 x N(z]0. 1). even though these two models are indeed very 
“close” to N(x | yz, a). 7 

The bias of an estimator 9(x) is defined to be 


= [ O(c) (| 6) de -6 
and its mean squared error (mse) to be 
mse(6 | 0) = [18.2 - ? p(x | 0) dx = V(0|0) + {b(8)}?. 


From a frequentist point of view it is desired that, in the long run, @ should be as 
close to 8 as possible; thus, if quadratic loss is judged to be an appropriate “distance” 
measure, a frequentist would like an estimator @ with small mse(0 | @) for almost 
all values of 8. A concept of relative efficiency is developed in these terms. An 
estimator 0, is more efficient than 0». if, for all @, mse(@; | @) < mse(O> |). 

A simple theory is available if attention is restricted to unbiased estimators, 
i.e., estimators such that b(@) = 0, since then we simply have to minimise V’(0 | 0) 
in this unbiased class. However, although requiring the sampling distribution of 7) 
to be centred at @ may have some intuitive appeal, there are powerful arguments 
against requiring unbiasedness. Indeed: 


(i) In many problems, there are no unbiased estimators. For instance, r/n is an 
unbiased estimator of the parameter @ for a binomial Bi(r | 0. 7) distribution. 
but there is no unbiased estimator of 0'“? 

(ii) Even when they exist, unbiased estimators may give nonsensical answers. and 
no theory exists which specifies conditions under which this can be guaranteed 
not to happen. For example, the (unique) unbiased estimator of the parameter 
6 € (0.1) of a geometric distribution p(x |@) = (1 — A)", ar = OL)... 
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is 6(0) = 1, O(x) = 0,r = 1,2,...; hardly a sensible solution! Similarly 
(see Ferguson, 1967), if @ is the mean of a Poisson distribution, Pn(z | 6) = 
e~°6 /x!, x = 0,1,..., then the only unbiased estimator of e~*, a quantity 


which must lie in (0,1), is 1 if x is even and 0 if it is odd (again, hardly 
sensible); but—even more ridiculously —the on/y unbiased estimate of e~74 
is (—1)7, leading to the estimate of a probability as —1 (for all odd x)! 

(iii) The unbiasedness requirement violates the likelihood principle, by making the 
answer dependent on the sampling mechanism. Thus, the unbiased estimator 
of yz: from a N(z | 4,0) observation is x, but the unbiased estimator from 

p(z|p.0) = N(z | pc), ifx < 100 
= N(z|0. 1) otherwise 
will be something else. Yet, if one is measuring jz with an instrument which 
only works for values x < 100 and obtains x = 50, i-e., a valid measurement, 
it seems inappropriate to make our estimate of jz dependent on the fact that 
we might have obtained an invalid measurement, but did not. 

(iv) Even from a frequentist perspective, unbiased estimators may well be unap- 
pealing if they lead to large mean squared errors, so that an estimator with 
small bias and small variance may be preferred to one with zero bias but a 
large variance. 


For further discussion of the conflict between Bayes and unbiased estimators, 
see Bickel and Blackwell (1967). See, also, Wald (1939). 

Another frequentist criterion for judging an estimator concerns the asymptotic 
behaviour of its sampling distribution. If we write 6, = @(21,...,Zp) to make 
explicit the dependence of the estimator on the sample size, a frequentist would 
clearly like @,, to converge to 6 (in some sense) as 7 increases. An estimator 0, 
is said to be weakly consistent if 8,, — @ in probability, and strongly consistent if 
0,, — @ with probability one. By Chebychev’s inequality, a sufficient condition 
for the weak consistency of unbiased estimators is that V(@,) — Oasn — 0. 
Obviously, a consistent estimator is asymptotically unbiased. 

For discussion on the consistency of Bayes estimators, see, for example, 
Schwartz (1965), Freedman and Diaconis (1983), de la Horra (1986) and Diaconis 
and Freedman (1986a, 1986b). For the frequentist properties of Bayes estimators, 
see Diaconis and Freedman (1983). 


“Optimum” Estimators 


We have mentioned before that minimising the variance among unbiased estimators 
is often suggested as a procedure for obtaining “good” estimators. Sometimes, this 
procedure is even further restricted to linear functions; thus, provided p. = E(x | @) 
and o* = V(x |@) exist, Z is said to be the best linear unbiased estimator (BLUE) 
of 1, in the sense that it has the smallest mse among all linear, unbiased estimators. 

It is easy, however, to demonstrate, with appropriate examples, that this is a 
rather restricted view of optimality, since non-linear estimators may be considerably 
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more efficient. An “absolute” standard by which unbiased estimators may be judged 


is provided by the Cramer-Rao inequality. Let 9 = g(x) be an unbiased estimator 
of g(@) and define the efficient score function u(x | 0) to be 


a) 
u(x |é) = 0 log p(x | 6). 


Then, under suitable regularity conditions, £,. g[u(a|@)| = 


[dg(@)/do}? 
I(0) 


1(8) = Ex (u*(a|)] = —Exp sai) 


W(g| 8) = 


where 


with equality if, and only if, 
u(x |O) = k(8) {g(a) — g(A)}. 


where k(9) does not depend on z. in which case g is said to be a minimum vari- 
ance bound (MVB) estimator of g(@). It follows that a minimum variance bound 
estimator must be sufficient, unbiased, and a linear function of the score function. 

We have already stressed that limiting attention to unbiased estimators may 
not be a good idea in the first place. Moreover, the range of situations where 
“optimal” unbiased estimators, i.e., the MVB estimators, can be found is rather 
limited. Indeed, if 9 is sufficient for @ there is a unique function g(@) for which a 
MVB exists, namely that described above. For example, if z = {2)...... r,}isa 
random sample from N(x |0. 07). D2?/n is a MVB estimator for 7”. but no MVB 
estimator exists for co! 

One might then ask whether it is at least possible to obtain an unbiased esti- 
mator with a variance which is lower than that of any other estimator for each @. 
even if it does not reach the Cramer-Rao lower bound. Under suitable regularity 
conditions, the existence of such uniformly minimum variance (UMV) estimators 
can indeed be established. Specifically, Rao (1945) and Blackwell (1947) indepen- 
dently proved that if 6(a) is an estimator of @ and t = t(z) is a sufficient statistic 
for 9, then, given the value of the sufficient statistic ¢. the conditional expectation 
of 6(x), 


(t) = E(6}t] = / 6(a)p(x | t)de 


is an improved estimator of 9, in the sense that, for every value of 0, mse(6| 0) < 
mse(6 | 9), a result which can be generalised to multidimensional problems. 

A decision-theoretic consequence of the so-called Rao-Blackwell theorem is 
that any estimator of @ which is not a function of the sufficient statistic ¢ must be 
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inadmissible. However, as a constructive procedure for obtaining estimators this 
result is of limited value due the fact that it is usually very difficult to calculate the 
required conditional expectation. 

If 6(a) is unbiased and complete, and there is a complete sufficient statistic 
t = t(a),then 6(t) is unbiased, and is the UMV estimator of 9. For example, r/n 
is the MVB estimator of the parameter 8 of a binomial distribution Bi(r | 6, 7), but 
there is no MVB estimator of 6?. However, the result may be used to show that 
r(r —1)/[n(n — 1)] is a UMV estimator of 67. 

MLE estimators are not guaranteed to exist or to be unique; but when they 
do exist they typically have very good asymptotic properties. Under fairly general 
conditions, MLE’s can be shown to be consistent (hence asymptotically unbiased, 
even if biased in small samples), asymptotically fully efficient, and asymptotically 
normal, so that, if m — oo, the sampling distribution of 6 converges to the normal 
distribution N(@ | 0, J(@)) with mean 6 and precision J(@), the information function. 

Bayesian estimators always exist for appropriately chosen loss functions and 
automatically use all the relevant information in the data. They are typically biased, 
and have analogous asymptotic properties to maximum likelihood estimators (i.e., 
from a frequentist perspective they are consistent, asymptotically fully efficient and, 
under suitable regularity conditions, asymptotically normal). A famous example is 
the Pitman estimator (Pitman (1939), which may be obtained as the posterior mean 
which corresponds to a uniform prior; see, also, Robert er a/. (1993). 

Both the likelihood and the Bayesian solutions to the point estimation prob- 
lem automatically define procedures for obtaining them; the frequentist approach 
does not (expect for special cases like the exponential family). In addition to the 
MLE approach, other methods of construction include minimum chi-squared, least 
squares, and the method of moments. However, these methods do not in themselves 
guarantee any particular properties for the resulting estimators, which usually have 
to be investigated case by case. Historically, all these construction methods have 
been used at various times within the frequentist approach to produce candidate 
“good estimators”, which have then been analysed using the criteria described 
above. Nowadays, partly under the influence of classical decision theory, some 
frequentist statisticians pragmatically minimise an expected posterior loss to ob- 
tain an estimator, whose behaviour they then proceed to study using non-Bayesian 
criteria. 

For an extensive treatment of the topic of point estimation, see Lehmann 
(1959/1983). 


B.3.2 Interval Estimation 


Let {p(a | @),@ € ©} bea fully specified parametric family of models and suppose 
that it is desired to calculate, from the data x, a region C(x) within which the 
parameter @ may reasonably be expected to lie. Thus, rather than mapping X 
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into ©, as in point estimation, a subset of © is associated with each value of zx, 
whose elements may be claimed to be supported by the data as “likely” values of 
the unknown parameter 6. This is the so-called region estimation problem, when 
@ is one-dimensional, the regions obtained are typically intervals, hence the more 
standard reference to the interval estimation problem. 

Region estimates of @ may be motivated pragmatically as informative simple 
summaries of the inferences to be drawn from x about the value of 8 or. more 
formally, as a set of 8 values which may safely be declared to be consistent with 
the observed data. 

We recall from Section 5,1.5 that, within a Bayesian framework, credible 
regions provide a sensible solution to the problem of region estimation. Indeed, for 
each « value, 0 < a < 1,a 100(1 — a)% credible region C, i.e. such that 


[x 12)00 =l-a. 


€ 


contains the true value of the parameter with (posterior) probability 1 — a and, 
among such regions, those of the smallest size, i.e., the highest posterior density 
(HPD) regions, suggest themselves as summaries of the inferential content of the 
posterior distribution. Note that this formulation is equally applicable to prediction 
problems simply by using the corresponding posterior predictive distribution. 


Confidence Limits 


For 0 < a < land scalar 9 € © C R. a statistic 6° (a) such that for all @. 
Pr{O°(x) >0|0} =1-a, 


and such that if @, > a2 then @“! < 6°2, is called an upper confidence limit for 6 
with confidence coefficient 1 — a. Note that if g is strictly increasing, then g(8" ) 
is an upper confidence limit for 9(9). The nesting condition is important to avoid 
inconsistency; see e.g., Plante (1984, 1991). 

Given z, the specific interval (—><.6°(a)] is then typically interpreted as 
a region where, given x, the parameter @ may reasonably be expected to lie. It 
is crucial however to recognise that the only proper probability interpretation of a 
confidence interval is that, in the long run. a proportion 1 — « of the 6"(z) values 
will be larger than 9. Whether or not the particular 6° (a) which corresponds to 
the observed data x is smaller or greater than 9 is entirely uncertain. One only has 
the rather dubious “transferred assurance” from the long-run definition. 

A lower confidence limit 9, (x) is similarly defined as a statistic @, (a) such that 
Pr{@,(z) < 0|0} = 1 — a with the corresponding nesting property. Combining 
a lower limit at confidence level 1 — «@). with an upper limit at confidence level 
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1 — a», we obtain a two-sided confidence interval (8.., (x), 6°2(a)} at confidence 
level 1 — a — aq , such that, for all 6, 


Pr{9,,(@) <O< 6°2(ax) {0} = 1 — a, — ay. 


For two-sided confidence intervals, a convenient choice is @} = G9, which 


produces central confidence intervals based on equal tail-area probabilities. There 
are, however, other alternatives. 


(i) 


(ii) 


Shortest confidence intervals. For fixed a; + a2 = @, Q; and @ may be 
chosen to minimise the expected interval length Ey | 9[9°2(x) — 8, (2) | 6]. 
It must be realised, however, that shortest intervals for 6 do not generally 
transform to shortest intervals for functions g(6). It can be proved that intervals 
based on the score function 


u(x |) = <7 log p(x 8) 

have asymptotically minimum expected length; moreover, the fact that u has 
a sampling distribution which is asymptotically normal N(u|0,/(@)), with 
mean 0 and precision /(@), may be used to provide approximate confidence 
intervals for 0. 

Most selective intervals. For fixed a; + a2 = a, one could try to choose a 
and a2 to minimise the probability that the interval contains false values of 6. 
However, such uniformly most accurate intervals are not guaranteed to exist. 


It is worth noting that, for a variety of reasons, the construction of confidence 


intervals is by no means immediate. 


(i) 
(ii) 
(iii) 


(iv) 


(v) 


They typically do not exist for arbitrary confidence levels when the model is 
discrete. 

There is no general constructive guidance on which particular statistic to 
choose in constructing the interval. 

There are serious difficulties in incorporating any Known restrictions on the 
parameter space, and no systematic procedure exists for incorporating such 
knowledge in the construction of confidence intervals. 

In multiparameter situations, the construction of simultaneous confidence in- 
tervals is rather controversial. It is less than obvious whether one should 
use the confidence limits associated with individual intervals, or whether one 
should think of the problem as that of estimating a region for a single vector 
parameter, or as one of considering the probability that a number of confidence 
statements are simultaneously correct. 

Interval estimation in the presence of nuisance parameters is another contro- 
versial topic. Unless appropriate pivotal quantities can be found, the properties 
of various alternative procedures, typically based on replacing the unknown 
nuisance parameters by estimates, are generally less than clear. 
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(vi) Interval estimation of future observations poses yet another set of difficulties. 
Unless one is able to find a function of the present and future observation 
whose sampling distribution does not depend on the parameters (and this is 
not typically the case), one is again limited to ad hoc approximations based 
on substituting estimates for parameters. 


But, even in the simplest case where @ is a scalar parameter labelling a con- 
tinuous model p(a | @), the concept of a confidence interval is open to what many 
would regard as a rather devastating criticism. Namely the fact that the confidence 
limits can turn out to be either vacuous or just plain silly in the light of the observed 
data. We give two examples. 


(i) In the Fieller-Creasy problem, where the parameter of interest is the ratio of 
two normal means, there are values a@ < 1 such that, for a subset of possible 
data with positive probability, the corresponding | — « confidence interval is 
the entire real line. Solemnly quoting the whole real line as a 95% confidence 
interval for a real parameter is not a good advertisement for statistics. For 
Bayesian solutions, see Bernardo (1977) and Raftery and Schweder (1993). 

(ii) If x) and 2:2 are two random observations from a uniform distribution on the 
interval (6 — 0.5, + 0.5), and y; and y are, respectively, the smaller and the 
larger of these two observations, then it is easily established that for all @ 


Ply <0 < yO} =0.5 


so that (yi. y2) provides a 50%. confidence interval. However, if for the 
observed data it turns out that y. — y; > 0.5 then certainly y,; < 0 < y, so 
that we know Jor sure that 8 belongs to the interval (y;. ya). even though the 
confidence level of the interval is only 50%. 


These examples reflect the inherent difficulty that the frequentist approach to 
Statistics has of being unable to condition on the complete observed data. Con- 
ditioning on ancillary statistics, when possible, may mitigate this problem, but it 
certainly does not solve it and, as discussed in Section B.2.2, it may create oth- 
ers. The reader interested in other blatant counterexamples to the (unconditional) 
frequentist approach to statistics will find references in the literature under the key- 
words relevant subsets, which refer to subsets of the sample space yielding special 
information and subverting the “long-run” or “on average” frequentist viewpoint. 
Two important such references are Robinson (1975) and Jaynes (1976). see. also. 
Buehler (1959), Basu (1964, 1988), Cornfield (1969), Pierce (1973), Robinson 
(1979a, 1979b), Casella (1987, 1992), Maatta and Casella (1990) and Goutis and 
Casella (1991). 

As a final point, we should mention that for many of the standard textbook 
examples of confidence intervals (typically those which can be derived from uni- 
variate continuous pivotal quantities), the quoted intervals are numerically equal 
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to credible regions of the same level obtained from the corresponding reference 
posterior distributions. This means that, in these cases, the intuitive interpretation 
that many users (incorrectly, of course!) tend to give to frequentist intervals of 
confidence 1 — a, namely that, given the data, there is probability 1 — a that the 
interval contains the true parameter value, would in fact be correct, if described, 
instead, as a reference posterior credible interval. 

A typical example of this situation is provided by the class of intervals 


E-t is/Vn-1l<p<F+th_is/Vn—-1. a> 


for the mean y of a normal distribution with unknown precision. These are both the 
“best” confidence intervals for j, derivable from the sampling distribution of the 
pivotal quantity /n — 1(Z—,2)/s, and also the credible intervals which correspond 
to the reference posterior distribution for , ™(u|2) = St(u{z.(n—1)s-?,n —1) 
derived in Example 5.17. Buehler and Feddersen (1963) demonstrated that relevant 
subsets exist even in this standard case. Indeed, if 2 = {z,,r2}, then C = 
{2mins Xmax) is a 50% interval for 4, but if both observations belong to the set 


R= {(x1,22): lay — X2| > 4[Z|/3} 


then Pr{C|a2 € R,p,o} = 0.5181. Pierce (1973) has shown that similar sit- 
uations can occur whenever the confidence interval cannot be interpreted as a 
credible region corresponding to a posterior distribution with respect to a proper 
prior. Note that although this long-term coverage probability is not directly rele- 
vant to a Bayesian, the example suggests that special care should be exercised when 
interpreting posterior distributions obtained from improper priors. 

Casella et al. (1993) have proposed, for interval estimation, alternative loss 
functions to the standard linear functions of volume and coverage probability. 


B.3.3 Hypothesis Testing 


Let {p(x|@),@ € QO} be a fully specified parametric family of models, with O, 
partitioned into two disjoint subsets Oy and ©;, and suppose that we wish to decide 
whether the unknown @ lies in Qo or in ©). If Ho denotes the hypothesis that 
8 € Gp and H, the hypothesis that 9 € ©,, we have a decision problem, with only 
two possible answers to the inference problem, ay = accept Hy or a, = accept Hy, 
where the choice is to be made on the basis of the observed data x. This is the 
so-called problem of hypothesis testing. In most such problems, the two hypotheses 
are not symmetrically treated; the working hypothesis Hy is usually called the null 
hypothesis, while H, is referred to as the alternative hypothesis. Although the 
theory can easily be extended to any finite number of alternative hypotheses, we 
will present our discussion in terms of a single alternative hypothesis. 
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We recall from Section 6.1] that, within a Bayesian framework, the problem of 
hypothesis testing, as formulated above, can be appropriately treated using standard 
decision theoretical methodology; that is. by specifying a prior distribution and an 
appropriate utility function, and maximising the corresponding posterior expected 
utility. We also recall that the solution to the decision problem posed generally 
depends on whether or not the “true” model is assumed to be included in the family 
of analysed models. Assuming the stylised M-closed case, where the true model 
is assumed to belong to the family {p(x |@).@ € ©} and the utility structure is 
simply 

u(a,.0@) =0 gEeO;, i=0.1 
=-l,, d€0,. jFi. 
we have seen (Proposition 6.1) that the null hypothesis Hy should be rejected if, 
and only if, 
for PU) 


iny P(Ho) 


This corresponds to checking whether the appropriate (integrated) likelihood ratio, 
or Bayes factor. 


Bo(x) < 


Jo, P(@|0)p(8)d8/ J, v(0)d0 


Bu(z) = Jo, P(e |@)p(8)d0/ f, p( (0)d0- 


is smaller than a cut-off point which depends on the ratio /y;/li) of the losses 
incurred, respectively, by accepting a false null and rejecting a true null, and on the 
ratio of the prior probabilities of the hypotheses, 


p(A;) = [ p(d)dd. +=0.1. 
Jo; 

From the point of view of classical decision theory, the problem of hypothesis 
testing is naturally posed in terms of decision rules. Thus, a decision rule for this 
problem (henceforth called a est procedure 6, or simply a test 4) is specified in 
terms of a critical region R,, defined as the set of a values such that Ho is rejected 
whenever x € R;. The most relevant frequentist aspect of such a procedure 0 is 
its power function 

pow(@|0) = Pr{a € Ry | 8}. 


which specifies, as a function of @, the long-run probability that the test rejects the 
null hypothesis Hy. Obviously, the ideal power function would be 
pow(@|6) = 0. OE Oo 
=], dc«O, 


although, naturally, one will seldom be able to derive a test procedure with such 
an ideal power function. For any 8 € Qo. pow(@ | 4) is the long-run probability of 
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incorrect rejection of the null hypothesis; frequentist statisticians often specify an 
upper bound for such probability, which is then called the level of significance of 
the tests to be considered. The size of any specific test 6 is defined to be 


a = sup pow(@| 6); 
G€Qq 


thus, to specify a significance level a is to restrict attention to those tests whose 
size is not larger than a. 

Either Qo or ©, may contain just a single value of @. In this case, the corre- 
sponding hypothesis is referred to as a simple hypothesis; if 8; contains more than 
one value of @, then H; is referred to as a composite hypothesis. 

For any test procedure 6 one may explicitly consider two types of error; re- 
jecting a true null hypothesis, a so-called error of type |, and accepting a false null 
hypothesis, a so-called error of type 2. Let us denote by a(6|@) and 3(6|@) the 
respective probabilities of these two types of error, 


a(d|@) = Pr{a € R; | 6} if 8 € Oo, 
=0 otherwise 


3(6|@) = Pr{a ¢ Rs | 0} if 8 € ©, 
=0 otherwise. 


It would obviously be desirable to identify tests which keep both error proba- 
bilities as small as possible. However, typically, modifying 5 to reduce one would 
make the other larger. Hence, one usually tries to minimise some function of the 
two; for example, a linear combination aa(d | @) + 6G(4 | 4). 


Testing Simple Hypotheses 

When both Ho and H, are simple hypothesis, so that a(6|@) = a(6|@o) = a{d) 
and 3(6[@) = 8(6|@,) = (6d), it can be proved that a test which minimises 
aa(d) + b3(d) should reject Hp if, and only if, 


p(x | 0) 


a 

p(x | 6;) 
i.e., if the likelihood ratio in favour of the null is smaller than the ratio of the 
weights given to the two kinds of error. This can be seen as a particular case of 
the Bayesian solution recalled above, and is closely related to the Neyman- Pearson 


lemma (Neyman and Pearson, 1933, 1967) which says that a test which minimises 
3(6) subject to a(6) < a must reject Hp if, and only if, 


p(x | Oo) 
p(a | 1) 


e. 
53 
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for some appropriately chosen constant &. It has become standard practice among 
many frequentist statisticians to choose a significance level ay (often “conventional” 
quantities such as 0.05 or 0.01) and then to find a test procedure which minimises 
3(d) among all tests such that a(d) < ay (rather than explicitly minimising some 
combination of the two probabilities of error). The Neyman-Pearson lemma shows 
explicitly how to derive such a test. but it should be emphasised that this is nora 
sensible procedure. Indeed: 


(i) With discrete data one cannot attain a fixed specific size a (4) without recourse 
to auxiliary, irrelevant randomisation, whereas minimisation of a linear com- 
bination of the form aa(d) + b3(d) can always be achieved. For a Bayesian 
view on randomisation. see Kadane and Seidenfeld (1986). 

(ii) More importantly, by fixing a(6) and minimising :3(0) one may find that, with 
large sample sizes, Hy is rejected when p(a | Hy) is tar larger than p(a | H,). 
due to the fact that the minimising :3(4) may be extremely small compared with 
the fixed a(4). Although this can be avoided by carefully selecting a(4) asa 
decreasing function of the sample size, it seems far more natural to minimise 
a linear combination aa(é) + 6:3(6) of the two error probabilities, in which 
case no difficulties of this type can arise. 

Other strategies for the choice of (0) and .3(6) have been proposed. For example, 
in the 0 — 1 loss case. a(d) = .3(6) corresponds to the minimax principle. However. 
it is important to note (sec e.g., Lindley, 1972) that minimising a /inear combination 
of the two types of error is actually the on/y coherent way of making a choice, in 
the sense that no other procedure is equivalent to minimising an expected loss. 


Composite Alternative Hypotheses 

In spite of the difficulties described above, frequentist statisticians have traditionally 
defined an optimal test 6 to be one which minimises ;3(6 | 8) for a fixed significance 
level cy. In terms of the power function. this implies deriving a test 6 such that 


pow(0 | 6) < ap. AEOn 


and for which pow(@ | 0) is as large as possible in Q,. A test procedure 6° is called 
a uniformly most powerful (UMP) test, at level of significance ag. if a(6" 18) < ay 
and, for any other 6 such that a(6|@) < no, 


pow(8|6) < pow(@| 0"). for all 8 € ©}. 


Tt can be proved that. when © is one-dimensional, UMP tests often exist for one- 
sided alternative hypotheses. 

A model {p(x |@).6 € 8 C FR} is said to have a monotone likelihood ratio in 
the statistic ¢ = ¢(a) if for all 0; < 62, p(x | 62)/p(ax | )) is an increasing function 
of ¢. If p(a|6) has a monotone likelihood ratio in f and ¢ is a constant such that 


Pr{t 2 c| Aa} =A. 
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then the test 6 which rejects Ho if t > cisa UMP test of the hypothesis Hyp = 6 < @% 
versus the alternative H,; = 0 > Qo, at the level of significance ag. However, UMP 
tests do not generally exist. 


Example B.8. (Non-existence of a UMP test). lf x = {2,,....2x,} is a random 
sample from a normal distribution N(z |, 1), then the test 6; defined by critical region 
Rs, = {@; E— flo > 1.282//n} isa UMP test for Hy = je < jug versus Hy = pe > po, with 
0.10 significance level. Similarly, the test 62 defined by Rs, = {2% wo — F > 1.282//n} is 
a UMP test for Hy = pt > po versus H, = jt < pip, with the same level. Since these critical 
regions are different, it follows that there is no UMP test for 2. = fy versus p # Uo. 


The fact, illustrated in the above example, that UMP tests typically do not 
exist for two-sided alternatives, suggests that a less demanding criterion must be 
used if one is to define a “best” test among those with a fixed significance level. 
Since the power function pow(@ | 4) describes the probability that the test 6 rejects 
the null, it seems desirable that, when Hy is true, pow(@|6) should be smaller in 
Qo than elsewhere. A test 6 is called unbiased if for any pair A9 € Oo and 8, € 0; 
it is true that pow(@p | 6) < pow(@, | 4). 


Example B.9. (Comparative power of different tests). If x = {x,,...,2,,} is a ran- 
dom sample from a normal distribution N(. | j:. 1) then the test 6; defined by the region 


Ri, = {xi |Z - uo | > 1.645/ Yn} 


“S. pow(u | 62) 


A 


Figure B.2. Power of tests for the mean of a normal distribution 
By = 1, n = 30, c = 1.40. 2 = 2.05 
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is an unbiased test for Hy = jz = py versus H, = pt F pty. Figure B.2 compares the power 
of this test with those defined in Example B.8, and with that of a typical non-symmetric test 
4, of the same level, which has the critical region 


Ra, = {ai F- pin > rf VN. oF pty — 0 > &2/ Vn} 


for suitably chosen constants « < ¢,. It seems obvious that 6,, which is more cautious 
about accepting values of j: larger than jz» than about accepting values of j« smaller than ji, 
should be preferred to the unbiased test 6; whenever the consequences of the first class of 
errors are more serious, or whenever the values of j: smaller than j,, are considered to be 
more likely. 


It is clear from Example B.9 above that, even when they exist, unbiased 
procedures may only be reasonable in special circumstances. We are drawn again 
to the general comment that, in any decision procedure, prior information and utility 
preferences should be an integral part of the solution. 

Yet another approach to defining a “good” test when UMP tests do not exist is 
to focus attention on /ocal power, by requiring the power function to be maximised 
in a neighbourhood of the null hypothesis. Under suitable regularity conditions, 
locally most powerful tests may be derived by using the sampling distribution of 
the efficient score function in a process which is closely related to that described 
in our discussion of interval estimation. However, the requirement of maximum 
local power does not say anything about the behaviour of the test in a region of high 
power and, indeed, locally most powerful tests may be very inappropriate when the 
true value of @ is far from Oo. 


Methodological Discussion 


Testing hypotheses using the frequentist methodology described above may be 
misleading in many respects. In particular: 


(i) It should be obvious that the all too frequent practice of simply quoting whether 
or not a null hypothesis is rejected at a specified significance level ay ignores 
a lot of relevant information. Clearly, if such a test is to be performed, the 
statistician should report the cut-off point a such that Ho would not be rejected 
for any level of significance smaller than a. This value is called the tail area 
or p-value corresponding to the observed value of the statistic. An added 
advantage of this approach is that there is no need to select beforehand an 
arbitrary significance level. As noted in the case of confidence intervals, there 
is a tendency on the part of many users to interpret a p-value as implying 
that the probability that Hy is true is smaller than the p-value. Not only, 
of course. is this false within the frequentist framework but, in this case. 
there is, in general. no simple form of reinterpretation which would have a 
Bayesian justification, so that, even numerically. p-values cannot generally 
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be interpreted as posterior probabilities. For detailed discussions see, for 
example, Berger (1985a), Berger and Delampady (1987) and Berger and Sellke 
(1987). See Casella and Berger (1987) for an attempted reconciliation in the 
case of one-sided tests. 

(ii) Another statistical “tradition” related to hypothesis testing consists of declar- 
ing an observed value statistically significant, implying that there exists sta- 
tistical evidence which is sufficient to reject the null hypothesis, whenever the 
corresponding tail area is smaller than a “conventional” value such as 0.05 or 
0.01. However, since the classical theory of hypothesis testing does not make 
any use of a utility function, there is no way to assess formally whether or not 
the true value of the parameter 8, which may well be numerically different 
from a hypothetical value 6, is significantly different from @p in the sense 
of implying any practical difference. Thus, a vote proportion of 34% for a 
political party is technically different from a proportion of 34.001 %, but under 
most plausible utility functions the difference has no political significance. 

(iii) Finally, the mutual inconsistency of frequentist desiderata often makes it im- 
possible, even in the theory’s own terms, to identify the most appropriate 
procedure. For example, if x is a random sample from N(z | 4, 7A) with pre- 
cision m. determined by a random integer m, then m is ancillary and hence, 
by the conditionality principle, tests on y or A should condition on the observed 
m. Yet, Durbin (1969) showed that, at least asymptotically, unrestricted tests 
may be uniformly more powerful. 


See Chernoff (1951) and Stein (1951) for further arguments against standard 
hypothesis testing. 


B.3.4 Significance Testing 


In the previous section, we have reviewed the problem of hypothesis testing where, 
given a family {p(x | @),@ € ©}, a null hypothesis Hp = @ € Op is tested against 
(at least) one specific alternative. In this section we shall review the problem of 
pure significance tests, where only the null hypothesis Hy = {p(x|@),9 € Oo} 
has been initially proposed, and it is desired to test whether or not the data x are 
compatible with this hypothesis, without considering specific alternatives. The null 
hypothesis may be either simple, if it completely specifies a density p(x | 80), or 
composite. 

We recall from Section 6.2 that, within the Bayesian framework, the problem 
of significance testing, as formulated above, could be solved by embedding the 
hypothetical model in some larger class {p(z|@),@ € ©}, designed either to 
contain actual alternatives of practical interest, or formal alternatives generated by 
selecting a mathematical neighbourhood of Ho. For any discrepancy measure 


5(0) = u{Hé. 0} — u{ Ho. 9}, 
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describing, for each @, the conditional utility difference, and for any function £4(x), 
describing the additional utility obtained by retaining Hy because of its special 
status, we showed that Hy should be rejected if, 


t(x) > eo(x), 


where 
t(a) = [ s1@)p(0|2)49 


is the expected posterior discrepancy. In particular, we proposed the logarithmic 
discrepancy 

: p(x | 8) 

6(8) i p(x | 6) log Mel Bon 
as a reasonable general discrepancy measure. This (fully Bayesian) procedure 
could be described as that of selecting a test statistic {(a) which is expected to 
measure the discrepancy between Hy and the true model, and rejecting Hy if t(a) 
is larger than some cut-off point ¢)(x) describing the additional utility of keeping 
Hg if it were true, due to its special status corresponding to simplicity (Occam's 
razor), scientific support (or fashion), or whatever. 

From a frequentist point of view, a test statistic ¢ = t(a:) is selected with two 

requirements in mind. 


(i) The sampling distribution of ¢ under the null hypothesis p(t | Ho) must be 
known and, if Hp is composite, p(¢ | Ho) should be the same for all 8 € Op. 

(ii) The larger the value of f the stronger the evidence of the departure from Hi, 
of the kind which it is desired to test. 


Then, given the data x, a p-value or significance level is calculated as the prob- 
ability, conditional on Hy. that, in repeated samples, ¢ would exceed the observed 
value ¢(x), so that p is given by 


mx 
p= | p(t | Ho)dt. 
t 


{r) 


Small values of p are regarded as strong evidence that Ho should be rejected. 
The result of the analysis is typically reported by stating the p-value and declaring 
that Hy should be rejected for all significance levels which are smaller than p. 

Comparison with the Bayesian analogue summarised above prompts the fol- 
lowing remarks. 


(i) The trequentist theory does not generally offer any guidance on the choice of 
an appropriate test Statistic (the generalised likelihood ratio test. a disguised 
Bayes factor seems to be the only proposal). While in the Bayesian analysis 
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t(a:) is naturally and constructively derived as an expected measure of discrep- 
ancy, the frequentist statistician must, in general, rely on intuition to select f. 
The absence of declared alternatives even precludes the use of the frequentist 
optimality criteria used in hypothesis testing. 

(ii) Even if a function t = t(a) is found which may be regarded as a sensible 
discrepancy measure, the frequentist statistician needs to determine the un- 
conditional sampling distribution of t under Hy; this may be very difficult, 
and actually impossible when there are nuisance parameters. Moreover, in 
the more interesting situation of composite null hypotheses, it is required that 
p(t | @) be the same for all @ in Qo, which, often, is simply not the case. 

(iti) If a measure of the strength of evidence against Hp is all that is required, 
the position of the observed value of t with respect to its posterior predictive 
distribution p(t | a, Ho) under the null hypothesis seems a more reasonable, 
more relevant answer than quoting the realised p-value. Indeed, the compat- 
ibility of t(a) with Hy may be described by quoting the HPD intervals to 
which it belongs, or may be measured with any proper scoring rule such as 
A log p(t(x) | 2, Ho) +B. Thus, in Figure B.3, t;(a) may readily be accepted 
as compatible with Hy while t2(a) may not. 


p(t(zx) | x, Ho) 


ty (x) te(x) 


Figure B.3_ Visualising the compatibility of t(a) with Ho 


(iv) If adecision on whether or not to reject Ho has to be made, this should certainly 
take into account the advantages of keeping Hy, i.e., defining the cut-off point 
in terms of utility. We described in Section 6.2 how this may actually be 
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chosen to guarantee a specified significance level. but this is only one possible 
choice, not necessarily the most appropriate in all circumstances. 


We should finally point out that most of the criticisms already made about 
hypothesis testing are equally applicable to significance testing. Similarly. criti- 
cisms made of confidence intervals typically apply to significance testing. since 
confidence intervals can generally be thought of as consisting of those null values 
which are not rejected under a significance test. 


B.4 COMPARATIVE ISSUES 


B.4.1 Conditional and Unconditional Inference 


At numerous different points of this Appendix we have emphasised the following 
essential difference between Bayesian and frequentist statistics: Bayesian statistics 
directly produces statements about the uncertainty of unknown quantities, either 
parameters or future observations. conditional on known data: frequentist statistics 
produces probability statements about hypothetical repetitions of the data condi- 
tional on the unknown parameter, and then seeks (indirectly) ways of making this 
relevant to inferences about the unknown parameters given the observed data. In- 
deed. the problem at the very heart of the frequentist approach to statistics is that of 
connecting aggregate. long-run sampling properties under hypothetical repetitions. 
to specific inferences of a totally different type. Not only may one dispute the exis- 
tence of the conceptual “collective” where these hypothetical repetitions might take 
place, but the relevance of the aggregate. long-run properties for specific inference 
problems seems. at best. only tangential. 

It is useful to distinguish between two very different concepts. initial precision 
and final precision, introduced by Savage (1962). Thus. frequentist procedures are 
designed in terms of their expected behaviour over the sample space: they typically 
have average characteristics which describe, for each value of the unknown param- 
eters, the “precision” we may initially expect, before the data are collected. Thus. 
for example, one might expect that. in the long run, the true mean j will be included 
in 95% of the intervals of the form # + 1.96/\/n which might be constructed by 
repeated sampling from a normal distribution with known unit precision. 

A far more pertinent question, however, is the following: given the observed 
wt which derives from the observed sample (which typically will not be repeated in 
any actual practice), how close is the unknown j;/ to the observed .r? Within the 
frequentist approach. one must rest on the rather dubious “transferred” properties 
of the long-run behaviour of the procedure. with no logical possibility of assessing 
the relevant final precision. 

Thus, p-values or confidence intervals are Jargely irrelevant once the sample 
has been observed, since they are concerned with events which might have occurred, 
but have not. Indeed. to quote Jeffreys (1939/1961. p. 385) 
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... ahypothesis which may be true may be rejected because it has not predicted 
observable results which have not occurred. This seems a remarkable procedure. 


The following example, taken from Welch (1939) further illustrates the dif- 
ference between initial and final precision. 


Example B.10. (Initial and final precision). Let x = {x,,....2,} be a random 
sample from a uniform distribution over |j: — 4. + i. It is easily verified that the midrange 
ft = (Lwin + Lmnax)/2 is a very efficient estimator with a sampling variance of the order of 
1/n?, rather than the usual 1/7, so that, from large samples, we may expect, on average, 
very precise estimates of j:. Suppose however that we obtain a specific large sample with a 
small range; this is, admittedly, unlikely, but nevertheless possible. Given the sample and 
using a uniform (reference) prior for j1, we can only really claim that # €]Zinax — h Linn + 3 
(since the reference posterior distribution is uniform on that interval). Thus, if the actual 
data turn out this way, the final precision of our inferences about jz is bound to be rather 
poor, no matter how efficient the estimator jz was expected to be. 


The need for conditioning on observed data can be partially met in frequentist 
procedures by conditioning on an ancillary statistic. Indeed, we saw in Examples 
B.2 and B.3 that it is easy to construct examples where totally unconditional proce- 
dures produce ludicrous results. However, as pointed out in our discussion of the 
conditionality principle, there remain many problems with conditioning on ancil- 
lary statistics; they are not easily identifiable, they are not necessarily unique and, 
moreover, conditioning on an ancillary statistic, can yield a totally uninformative 
sampling distribution, and can conflict with other frequentist desiderata, such as 
the search for maximum power in hypothesis testing; see, for example, Basu (1964, 
1992) and Cox and Reid (1987). See, also, Berger (1984b). 


B.4.2 Nuisance Parameters and Marginalisation 


Mostrealistic probability models make the sampling distribution dependent not only 
on the unknown quantity of primary interest but also on some other parameters. 
Thus, the full parameter vector @ can typically be partitioned into @ = {@, A} 
where @ is the subvector of interest and A is the complementary subvector of @, 
often referred to as the vector of nuisance parameters. 

We recall from Section 5.1 that, within a Bayesian framework, the presence 
of nuisance parameters does not pose any formal, theoretical problems. Indeed, 
the desired result, namely the (marginal) posterior distribution of the parameter of 
interest, can simply be written as 


p(o|a) = / plo, | a)dd 
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where the full posterior p(@. A | a) is directly obtained from Bayes’ theorem. 

The situation is very different from a frequentist point of view. Indeed, the 
problem posed by the presence of nuisance parameters is only satisfactorily solved 
within a pure frequentist framework in those few cases where the optimality cri- 
terion used leads to a procedure which depends on a statistic whose sampling dis- 
tribution does not depend on the nuisance parameter. Frequentist inferences about 
the mean of a normal! distribution with unknown variance based on the Student-/ 
Statistic, whose sampling distribution does not involve the variance, provides the 
best known example. In general, frequentists are forced to use approximate meth- 
ods, typically based on asymptotic theory. Indeed. some statisticians see this as the 
main motivation for developing asymptotic results: 


a... serious difficulty is that the techniques ... for problems with nuisance 
parameters are of fairly restricted applicability. It is, therefore, essential to have 
widely applicable procedures that in some sense provide good approximations 
when “exact” solutions are not available. ... the central idea being that when 
the number rr of observations is large and errors of estimation correspondingly 
small, simplifications become available that are not available in general (Cox and 
Hinkley, 1974, p. 279, our italics) 


However, even the domain of “fairly restricted applicability” resulting from 
reliance on asymptotic methods can be problematic. In an early paper, Neyman and 
Scott (1948) illustrated such problems by considering models with many nuisance 
parameters of the type 


p(xlo.A) = [[o. |@.A,). 


oe) 


where a new nuisance parameter 4, is introduced with each observation. Note 
that such models are not unrealistic: for example, .c; may be a physiological mea- 
surement on individual i, which may have a normal distribution with mean 4, and 
common variance @, the latter being the parameter of interest. Kiefer and Wol- 
fowitz (1956) and Cox (1975) proposed solutions for this type of problem based 
on treating the \,"s as independent observations from some distribution. From a 
Bayesian viewpoint, this, of course, then becomes a case of hierarchical modelling 
as discussed in Section 4.6.5. 

The only general alternative strategy which has been proposed to avoid resort- 
ing to asymptotics when exact methods are not available is to use a modified form 
of likelihood, (estimated, conditional, or marginal), where the dependence on the 
nuisance parameters has been reduced or eliminated. 

An estimated likelihood is obtained by replacing nuisance parameters by (for 
example) their maximum likelihood estimates. This procedure does not take ac- 
count of the uncertainty due to the lack of knowledge about the nuisance parameters. 
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and may be misleading both in the precision and in the location associated with 
inferences about the parameters of interest. For example, in linear regression with 
many regressors, substitution of the regression coefficients by their mle’s leads to 
an estimate of the variance which is misleadingly precise. 


Marginal and conditional \ikelihoods are based on breaking the likelihood 
function into two factors, either using invariance arguments or conditioning on 
sufficient statistics for the nuisance parameters. In both cases, one factor provides 
a likelihood function for the parameter of interest while the other is assumed “to 
contain no information about the parameter of interest in the absence of knowledge 
about the nuisance parameter”. Key references to this approach are Kalbfleish and 
Sprott (1970, 1973) and Andersen (1970, 1973). 


There are however two main problems with this type of approach. 


(i) They are not general and can only be applied under rather specific circum- 
stances. 


(ii) They critically depend on the highly controversial notion of a “function not 
containing relevant information in the absence of knowledge about the nui- 
sance parameters”, for which no operational definition has ever been provided. 


In the cases where the techniques can be applied, and a consensus seems to 
exist about this vague information condition, the resulting forms tend to coincide, 
as one might expect, with the integrated likelihood 


i p(x |, A)r(A| d)dA, 


integrated with respect to the conditional reference prior distribution 7(A |) of 
the nuisance parameters given the parameter of interest. 


Profile likelihood provides a much more refined version of this approach, which 
often gives answers which closely correspond to Bayesian marginalisation results; 
the Fieller-Creasy problem concerning the ratio of normal means provides a typical 
example (see Bernardo, 1977). For further discussion and extensive references, see, 
forexample, Barndorff-Nielsen (1983, 1991), Cox and Reid (1987), Cox (1988) and 
Fraser and Reid (1989). Another suggestion, closely related to fiducial inference, 
is the implied likelihood (Efron. 1993). 


Liseo (1993) shows that reference posterior credible regions have better fre- 
quentist coverage properties than those obtained from likelihood methods. For a 
Bayesian overview of methods for treating nuisance parameters, see Basu (1977), 
Dawid (1980a), Willing (1988) and Albert (1989). 
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B.4.3 Approaches to Prediction 


The general problem of statistical prediction may be described as that of inferring 
the values of unknown observable variables from current available information. 
Thus, from data x, usually a random sample {7)..... ri}, inference statements 
are desired about, as yet, unobserved data y. often :r,,,., (the original problem 
considered by Bayes, 1763, in a binomial setting). 

We recall from Section 5.1.3 that, from a Bayesian point of view, with an 
operationalist concern with modelling uncertainty in terms of observables, Bayes’ 
theorem, in its central role as a coherent learning process about parameters, is just 
a convenient step in the process of passing from 


oe) = fT] pte, |6)»(@)d0 
i=] 


to 
p(y |x) = / p(y |6)p(8 | x)d@ 


by means of p(@|x) x p(x |@)p(@). Since any valid coherent inferential state- 
ment about y given z is contained in the posterior predictive distribution p(y | x), 
no special theory has to be developed. Of course, the inferential content of the 
predictive distribution may be appropriately summarised by location or spread 
measures, respectively providing “estimators” of y, such as the mean and the mode 
of p(y|a), or “interval estimates” of y such as the class of HPD intervals which 
may be derived from p(y |x). Moreover, if one is faced with a decision problem 
whose utility function u(a. y) involves a future observable, then p(y | x) becomes 
the necessary ingredient in determining the optimal action, a*, which maximises 
the appropriate (posterior) expected utility 


wal) =f wla.yplyl2)ay 


The range of potential applications of these ideas is extensive. 


(i) Density estimation. The action space consists of the class of sampling distri- 
butions; the predictive distribution, which is the posterior expectation of the 
sampling distribution, is, for squared error loss, the optimal estimator of the 
sampling distribution. 

(ii) Calibration. Two observations (2j,,22,;) are made on a set of 7. individuals 
using two different measuring procedures, and it is desired to estimate the 
measurement y» that the second procedure would yield on a new individual. 
given that the measurement using the first procedure has turned out to be y,, 
The solution is a simple exercise in probability calculus leading to the required 
posterior predictive density p(y» | y:.21. £2). 
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(iii) Classification. This is a particular case of the problem of calibration where 
the 12;’s (and y2) can only take on a discrete, usually finite, set of values. 

(iv) Regulation. Incontexts analogous to (ii) and (iti), it is desired to select and fix a 
value of y; so that yo is as close as possible to a prescribed value. The solution 
is obtained by minimising the expectation of an appropriate loss function with 
respect to the predictive distribution p(y2 | y,,%1,%2). The particular case 
of optimisation obtains when it is desired to make yo as large (or small) as 
possible. 

(v) Model comparison. In a setting with alternative models, the latter may be 
compared in terms of their predictive posterior probabilities (cf. Section 6.1). 

(vi) Model criticism. The compatibility of a given model with observed data may 
be assessed by comparing the realised value of a test statistic with its predictive 
distribution under that model (cf. Section 6.2). 


For further details of the systematic use of predictive ideas, the reader is 
referred, for example, to Roberts (1965), Geisser (1966, 1974, 1980b, 1988) and 
Zellner (1986b). The books by Aitchison and Dunsmore (1975) and Geisser (1993) 
contain a wealth of detailed discussion of prediction problems, including those in- 
volving decision making. Applications of predictive ideas to classification, calibra- 
tion, regulation, optimization and smoothing are found, for instance, in Dunsmore 
(1966, 1968, 1969), Bernardo (1988), Racine-Poon (1988), Klein and Press (1992), 
Lavine and West (1992) and Zidek and Weerahandi (1992). See, also, Gelfand and 
Desu (1968) and Amaral-Turkman and Dunsmore (1985). 

It is important to recall here (see Section 5.1) that, by virtue of the representa- 
tion theorems, parameters are limiting forms of observables and, hence, inference 
about parameters may be seen as a limiting form of predictive inference about 
observables. Although in practice it is usually convenient to work via parametric 
models, this point, stressed by de Finetti (1970/1974, 1970/1975) has considerable 
theoretical importance. Cifarelli and Regazzini (1982), among others, have con- 
tinued this tradition by trying to develop a completely predictive approach which 
bypasses entirely the use of parametric models. 

We should emphasise again that the all too often adopted naive solution of 
prediction based on the “plug-in estimate” form 


p(y|z) = / p(y |)p(8|)d8 ~ p(y |4), 


effectively replacing the posterior distribution by a degenerate distribution assigning 
probability one to an estimator of @, usually the maximum likelihood estimate, 
is bound to give misleadingly overprecise inference statements about y, since it 
effectively ignores the uncertainty about 8. The point is illustrated in detail by 
Aitchison and Dunsmore (1975). 
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By comparison, the possibilities for frequentist-based prediction are fairly lim- 
ited. They are essentially limited to producing tolerance regions, R(a), designed to 
guarantee that, in the long run, a proportion p of possible samples 2 would produce 
regions /t(a:) such that 


Prly € R(x) |@] =1-—a. for all 8 € O. 


ie., regions which, for all parameters values, will contain a proportion 1 — a of 
future observations. If this sounds obscure, particularly in comparison with the 
simple idea of an HPD region from the predictive distribution p(y |x), we can but 
agree! Moreover: 


(i) In order to construct a tolerance region it is essential to find a function of y and 
x with a sampling distribution which does not involve @, something which is 
typically only possible in very simple stylised problems. 

(ii) The difficulties of “transferring” the long-run aggregate properties of confi- 
dence intervals into inference statements conditional on the observed data, are 
even more acute in a tolerance region setting. 


Descriptions of the frequentist approach to prediction are given in Cox (1975), 
Mathiasen (1979) and Barndorff-Nielsen (1980); Guttman (1970) provides a com- 
parison between frequentist tolerance regions and HPD regions from predictive 
distributions. 

Kalbfleish (1971) was one of the first to examine likelihood methods for predic- 
tion. Essentially, with t denoting a sufficient statistic for 8, he proposed computing 
a predictive distribution of the form 


p(y|t) = i p(y |6)F(@|t) de 


whenever a fiducial distribution for 0, f(@|t), can be derived from the sampling 
distribution of ¢. Of course, the method is not always applicable; moreover. even 
when it is, it may lead to inconsistent results when the fiducial distribution ts not a 
Bayes’ posterior. For instance, in the discussion which follows Kalbfleish’s paper. 
Lindley points out that if the model is 


I 


p(r|@) = ee i" +10". r>0.é>0 
and the method is applied both to obtain directly p(r,,.1 |... ....2,) and to obtain 
P(Tngie Ln | Lp... Ly.) and then p(ty41 [y.... ss r,,) from the joint predictive. 


one obtains different answers. This is an interesting example of the fact that fiducial 
distributions do not necessarily have basic coherence properties unless they are 
equivalent to Bayesian posterior distributions. 
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Since the late 1970's a variety of more sophisticated “likelihood prediction” 
methods have been proposed, some sufficiency-based, others relating to profile 
likelihood ideas. Seminal contributions include those by Hinkley (1979), Lejeune 
and Faulkenberry (1982), Butler (1986) and Lane and Sudderth (1989). A review 
is given by Bjrnstad (1990), and a further overview is provided by Geisser (1993). 

A more radical approach to prediction is set out in Dawid (1984), who sets 
out a theory of prequential analysis. This is closely related to our view that a 
model or theory is simply a probability forecasting system, but Dawid’s theory is 
not predicated on such a system necessarily being Bayesian. Instead, the basic 
ingredients are simply two sequences; one a string of observations, the other a 
string of probability forecasts. Theoretical developments requiring an extension 
of the standard Kolmogorov (1933/1950) framework for probability are pursued 
in Vovk (1993a). See, also, Vovk (1993b) and Vovk and Vyugin (1993). Links 
with stochastic complexity (Solomonoff, 1978; Rissanen, 1987, 1989; Wallace and 
Freeman, 1987) are reviewed in Dawid (1992). 


B.4.4 Aspects of Asymptotics 


In most statistical problems, a number of simplifications become available when 
the sample size becomes sufficiently large. In frequentist statistics, this is often 
the only way to obtain analytic results. From a Bayesian point of view, such 
simplifications are never theoretically necessary, although, of course, they often 
make computations easier and sometimes provide valuable analytic insight. 

We recall from Section 5.3 that, as the sample size increases, the posterior 
distribution of the parameter of interest 8 converges to a degenerate distribution 
which gives probability one to the true parameter value when the parameter of 
interest is discrete and, under suitable regularity conditions when the parameter of 
interest is continuous, converges to a normal distribution N(@| 6, H (6,,)), with 


precision matrix 
. & log p(x | @) 
HMOs) = (- 6:36; Ve 6, 


The most frequently used asymptotic results in frequentist statistics concem 
the large sample behaviour of the maximum likelihood estimate @,, which, un- 
der suitable regularity conditions (mathematically usually closely related to those 
required to guarantee posterior asymptotic normality), may be shown to have an 
asymptotically normal sampling distribution N(6,, | 9, nJ(@)), with precision ma- 
trix whose general element is 


42 
(1(8)),; = f o(e19) Gaol 


For details, see, for example, LeCam (1956, 1970, 1986), and references therein. 
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Since it is easily established that, for large n, H(6») converges to n/(@), and 
since, asymptotically, the sampling distribution of 6 ,, becomes a location model for 
9, it follows (Lindley, 1958) that the reference posterior distribution for @ and the 
asymptotic fiducial distribution of @ based on the sampling distribution of @,, are 
asymptotically equivalent. Moreover, the maximum likelihood estimator of @ and 
the asymptotic confidence intervals based on 6, will be, respectively, numerically 
identical to the mode (or the mean) and the HPD intervals based on the reference 
posterior distribution (or any other posterior distribution based on a reasonably 
well-behaved prior). 

These results explain the fact that, for large samples (relative to the dimension- 
ality of the parametric model component) there are typically very few numerical dif- 
ferences between Bayesian inferential statements and frequentist statements based 
on asymptotic properties. This asymptotic equivalence carries over, of course, to a 
number of applications. For example: 


(i) We showed (Corollary 2 to Proposition 5.17) that if @ is asymptotically nor- 
mal N(@|6,,. —L(6,,)) then, under appropriate regularity conditions (4) is 
asymptotically normal 


N(g(8) | 9(8)- A (8 )(9'(8n)}7)- 


The frequentist equivalent (typically derived using the delta method for de- 
termining the asymptotic distribution of an estimator) is that if @,, has an 
asymptotic sampling distribution N(@,, | 6.771(@)), then g(@) has an asymp- 
totic sampling distribution 


N(9(8,,) | 9(8). nt (8){9'(8)} *). 


(ii) The predictive distribution p(y | 2) is asymptotically approached by ):(y | 6,,). 
(iii) The action which maximises the posterior expected utility is, asymptotically. 
the same as that which maximises u(a. @,,). 


In the fictional world of unlimited data, numerical differences between fre- 
quentist and Bayesian solutions would tend to disappear with increasing sample 
size although, even then, differences in interpretation would persist. However, in 
the real world of limited data relative to the (often multiparameter) models required 
for realism, there is no reason to expect, in general, close coincidence of numerical 
solutions. 


B.4.5 Model Choice Criteria 


We have discussed earlier in Sections B.3.3 and B.3.4, the hypothesis and signifi- 
cance testing approaches to parametric hypotheses, but have noted that. in general, 
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no satisfactory exact procedures exist. This may be because of a lack of simplifica- 
tion via sufficiency or invariance arguments, resulting in intractable distributions, 
or because a procedure cannot be found which has uniformly optimal properties 
through the range of parameter values under the alternative hypothesis. 

A procedure frequently adopted in such situations is the so-called general 
maximum likelihood ratio test, which we describe first for the case of a simple null 
hypothesis, 9 = 0, € R* and observations x = (z).....2,). The procedure is 
motivated by considering the ratio 


wa) = 2120) 
p(x | 8) 


where @ is the maximum likelihood estimate. Intuitively, small values of r(2) 
suggest rejection of the null hypothesis, but using this type of test requires deriving 
the distribution of r(x), which is, in general, not possible. However, a simple 
asymptotic argument (see, for example, Cox and Hinkley, 1974, Section 9.3) reveals 
that, for suitable regularity conditions, under the nul! hypothesis, as n — 00, 
(a) = —2logr(a) has a limiting y2 distribution. 
The procedure is easily extended to the case of a composite null hypothesis 
@ € Oy C R*. If the alternative hypothesis is 9 € ©,, and we define 8 = Q9UC;, 
we consider the ratio 
SUPyce, P(t | 8) 
r(x2) = ———____. 
SUPgeo P(x | 4) 
In this case, asymptotic analysis reveals that, for suitable regularity conditions, 
(a) = —2logr(zx) has a limiting x3 distribution, where d is the difference in 
dimensionality, dim(©) — dim(©,), of the general and null hypothesis parameter 
spaces, respectively. 
It is interesting to compare this with a widely used Bayesian form of assessment 
of null and alternative models. Schwarz (1978) shows that, asymptotically, 


—2log Bo = A(x) — dlogn, 


where Bo; is the Bayes factor (Section 6.1.4). 

We see, therefore, that the so-called Schwarz criterion for model choice adjusts 
the —2 log r(x) factor by a log m multiple of the dimensionality difference. 

An earlier proposal for adjusting the general likelihood ratio criterion is that of 
Akaike (1973, 1974, 1978b, 1987), whose so-called Akaike Information Criterion 
(AIC) takes the form 

AIC = X\(a) — 2d. 


See also, Akaike (1978b, 1979) for a Bayesian extension (BIC) of the AIC proce- 
dure. 
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Yet another variant is found in Nelder and Wedderburn (1972), whose sug- 
gestion for goodness-of-fit comparisons of general linear models through plotting 
degrees of freedom against deviance is, in effect, the criterion 


Ma) -d. 


These and other related proposals are reviewed from a Bayesian perspective in 
Smith and Spiegelhalter (1980). See Stone (1977, 1979a) for further discussion 
and comparison. 

Roughly speaking, the Akaike criterion can be derived from a Bayes factor 
perspective as corresponding to a prior which concentrates on a neighbourhood of 
the alternative which is close, in an appropriate sense depending on », to the null. 
The Schwarz criterion is derived from a Bayes factor perspective through a prior 
which does not depend on vv. 

Finally, we note that the prequential theory of Dawid (1984)— see. also, Sec- 
tion B4.3—directly embraces the view that models are simply predictive tools and 
should be compared on that basis, but does not necessarily use a Bayesian mech- 
anism for such prediction. In Dawid (1992), it is shown that a particular form of 
so-called prequential assessment, based on the logarithmic scoring rule, leads to 
a model choice criterion which is asymptotically equivalent to the Schwarz crite- 
rion. It is also shown that this approach is essentially equivalent to model choice 
procedures arising in the stochastic complexity theory of Rissanen (1987). 
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